Network Working Group A. Clemm Internet-Draft L. Dong ...

Network Working Group A. Clemm

Internet-Draft L. Dong

Intended status: Informational Futurewei

Expires: January 12, 2023 G. Mirsky

Ericsson

L. Ciavaglia

Rakuten Mobile

J. Tantsura

Microsoft

M-P. Odini

July 11, 2022

Green Networking Metrics

draft-cx-green-metrics-00

Abstract

This document explains the need for network instrumentation that

allows to assess the power consumption, energy efficiency, and carbon

footprint associated with a network, its equipment, and the services

that are provided over it. It also suggests a set of related metrics

that, when provided visibility into, can help to optimize a network’s

energy efficiency and "greenness".

Status of This Memo

This Internet-Draft is submitted in full conformance with the

provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering

Task Force (IETF). Note that other groups may also distribute

working documents as Internet-Drafts. The list of current Internet-

Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months

and may be updated, replaced, or obsoleted by other documents at any

time. It is inappropriate to use Internet-Drafts as reference

material or to cite them other than as "work in progress."

This Internet-Draft will expire on January 12, 2023.

Copyright Notice

Clemm, et al. Expires January 12, 2023 [Page 1]

Internet-Draft July 2022

This document is subject to BCP 78 and the IETF Trust’s Legal

Provisions Relating to IETF Documents

(https://trustee.ietf.org/license-info) in effect on the date of

publication of this document. Please review these documents

carefully, as they describe your rights and restrictions with respect

to this document. Code Components extracted from this document must

include Simplified BSD License text as described in Section 4.e of

the Trust Legal Provisions and are provided without warranty as

described in the Simplified BSD License.

Table of Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2

2. Definitions and Acronyms . . . . . . . . . . . . . . . . . . 3

3. Energy Metrics . . . . . . . . . . . . . . . . . . . . . . . 4

3.1. Energy Metrics related to Equipment . . . . . . . . . . . 4

3.1.1. Base Metrics . . . . . . . . . . . . . . . . . . . . 4

3.1.2. Virtualization Considerations . . . . . . . . . . . . 6

3.2. Energy Metrics related to Flows . . . . . . . . . . . . . 7

3.3. Energy Metrics related to Paths . . . . . . . . . . . . . 8

3.4. Energy Metrics related to the Network-at-Large . . . . . 8

4. Other considerations and discussion items . . . . . . . . . . 9

4.1. User perspective . . . . . . . . . . . . . . . . . . . . 9

4.2. Holistic perspective . . . . . . . . . . . . . . . . . . 10

4.3. Sustainable equipment production . . . . . . . . . . . . 10

5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11

6. Security Considerations . . . . . . . . . . . . . . . . . . . 11

7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 11

8. Informative References . . . . . . . . . . . . . . . . . . . 11

Authors’ Addresses . . . . . . . . . . . . . . . . . . . . . . . 13

1. Introduction

Climate change and the need to curb greenhouse emissions have been

recognized by the United Nations and by most governments as one of

the big challenges of our time. As a result, improving energy

efficiency and reducing power consumption are becoming of increasing

importance for society and for many industries. The networking

industry is no exception.

Networks themselves consume significant amounts of energy.

Therefore, the networking industry has an important role to play in

meeting sustainability goals. Future networking advances will

increasingly need to focus on becoming more energy-efficient and

reducing carbon footprint, both for economic reasons and for reasons

of corporate responsibility. This shift has already begun and

sustainability is already becoming an important concern for network

providers [telefonica2020].

There are many vectors along which networks can be made "greener".

At its foundation, it involves network equipment itself. Making such

equipment more energy-efficient is a big factor in helping networks

become greener. However, opportunities also exist at the level of

protocols themselves (e.g. reduction of transmission waste and

enabling of rapid control loops), at the level of the overall network

(e.g. path optimization under consideration of energy efficiency as a

cost factor), and architecture level (e.g. placement of contents and

functions) [I.D.draft-cwx-green-ps].

However, regardless of any particular approach that is chosen, in

order to assess its impact, there is a need to have visibility into

the actual energy consumption that is occurring and to ideally be

able to attribute that consumption to its sources. As the adage

goes, you cannot manage what you cannot measure. By extension, you

cannot optimize what you have no visibility of. The ability to

instrument networks in a way that allows for the assessment of energy

consumption is hence an important enabler for potential energy

optimizations, allowing to assess the effectiveness of measures that

are being taken and enabling (for example) control loops that involve

energy as an input. Before instrumenting, it needs to be clear,

however, what the proper metrics are that network providers will be

interested in and that applications will seek to optimize.

This document defines a set of metrics that allow to assess the

"greenness" of networks and that form the basis for optimizing energy

efficiency, carbon footprint, and environmental sustainability of

networks and the services provided. These metrics are intended to

serve the foundation for possible later IETF standardization

activities, such as the definition of related YANG modules [RFC7950]

or energy-related control protocol extensions.

Please note that throughout this document, we will be using the terms

"green" and "energy efficient" interchangeably. In general, we will

be use these terms in a broad sense, encompassing also carbon

footprint and sustainability except when explicitly mentioned

otherwise. Likewise, we treat "energy efficiency" as synonymous with

"energy utilization efficiency", broadly speaking referring to the

efficiency with which energy is being utilized.

2. Definitions and Acronyms

Carbon footprint: as used in this document, the amount of carbon

emissions associated with the use or deployment of technology,

usually directly correlated with the associated energy consumption

CPU: Central Processing Unit

IPFIX: IP Flow Information eXport

TCAM: Ternary Content-Addressable Memory

pWh: pico Watt hour

Wh: Watt hour

3. Energy Metrics

In the following, we categorize energy metrics as follows:

o At the device/equipment level. This concerns aspects such as

energy consumption of a device as a whole, of equipment components

such as line cards or individual ports. It includes metrics that

would, for example, be found in equipment data sheets.

o At the flow level. This concerns aspects about energy consumption

by flows. Metrics at this level attribute energy consumption to a

o At the path level. These metrics attest to the end-to-end energy

efficiency of paths, attesting to their energy intensity

(reflecting e.g. the amount of energy drawn when the path is

selected) and taking into account, for example, whether a given

path includes segments known to be energy-intensive.

o At the network level. These metrics aggregate energy consumption

across a network to provide a holistic picture of the "network as

a system".

3.1. Energy Metrics related to Equipment

3.1.1. Base Metrics

Arguably the most relevant energy metrics relate to equipment as a

whole. After all, power is drawn from devices.

The power consumption of the device can be divided into the

consumption of the core components (e.g. the backplane and CPU) as

well as additional consumption incurred per port and line card. In

[I.D.draft-manral-bmwg-power-usage], the device factors affecting

power consumption are summarized: base chassis power, number of line

cards, number of active ports, port settings, port utilization,

implementation of packet classification of Ternary Content-

Addressable Memory (TCAM) and the size of TCAM, firmware version.

Furthermore it is important to understand the difference between

power consumption when a resource is idling versus when it is under

load. This helps to understand the incremental cost of additional

transmission versus the initial cost of transmission. Generally, the

cost of the first bit could be considered very high, as it requires

powering up a device, port, etc. The cost of transmission of

additional bits (beyond the first) is many orders of magnitude lower.

Likewise, the incremental cost of CPU and memory that will be needed

to process additional packets becomes fairly negligible.

The first set of metrics corresponds to ratings of the device:

o Power consumption when idle (e.g. Watts)

o Power consumption when fully loaded (e.g. Watts)

o Power consumption at various loads: e.g. 50% utilization, 90%

utilization

These metrics should be maintained for the device as a whole, and for

the subcomponents: i.e. for the chassis by itself, for each line

card, for each port. It should also take into account aspects such

as the current memory configuration, as the overall energy

consumption of a device is a function of the energy consumption of

the components the system is comprised of.

The metrics could be provided by the data sheet associated with the

device or they could be measured as part of a deployment. For

maximum accuracy and comparability, they should reflect pre-defined

environmental setting, e.g., operating temperature, relative

humidity, barometric pressure. For example, ATIS (Alliance for

Telecommunications Industry Solutions) [ATIS0600015.02] defines a

reference environment under which to measure router power

consumption: temperature of 25 celsius degree (within 3 celsius

degree deviation), relative humidity of 30% to 75%, barometric

pressure between 1020 and 812 mbar. In the AC power configuration,

the router should be evaluated at 230 VAC or within 1% deviation, 50

or 60 Hz or within 1% deviation [Ahn2014].

The second set of metrics reflects the actual power being drawn

during operation. It is the type of data that might be provided as

management data. Again, it should be provided for the device as a

whole, as well as for the subcomponents reflected in the device

hierarchy: line cards, ports, etc.

o Current power consumption (e.g. Watts)

o Power drawn since system start (or module insertion, if at the

level of a line card, or port activation, if at the level of a

port), for the past minute (e.g. Watt hours)

The third set of metrics are derived from the earlier metrics. They

normalize the power consumption relative to the line speeds

respectively amount of traffic that is passed.

o Current power consumption / kilooctet

The fourth set of metrics reflects expectation values about

incremental energy usage. It could be relevant for use cases that

assess the cost of additional traffic. [Bolla2011] and [Ahn2014]

found that the power consumption of a router is in direct proportion

of the link utilization as well as the packet sizes.

o Incremental power per packet, per kilooctet, per gigaoctet.

(Possible units might be pWh - pico Watt hours)

In addition to these metrics, it is conceivable to also have the

device reflect other context of relevance, such as the sustainability

rating of the power source. This could potentially be reflected

along a scale ranging from diesel-generator powered, via conventional

power grid, to renewable (powered by windmill, capture of excess

heat, etc). Also, the environmental status of the device could be

taken into consideration, such as whether it is deployed in a data

center and its share in contributing to the need for cooling. It is

conceivable to, for example, introduce corresponding metrics

indicating a "green rating" of device, and/or of the context in which

a device has been deployed.

3.1.2. Virtualization Considerations

Instrumentation should also take into account the possibility of

virtualization. This is important in particular as networking

functions may increasingly be virtualized and hosted (for example) in

a data center. Overlay networks may be formed. Likewise, many

applications expected to optimize energy consumption may be hosted on

controllers and applied to soft switches, VNFs (Virtual Network

Functions), or networking slices. The attribution of actual power

consumed to such virtualized entities is a non-trivial task. It

involves navigating layers of indirection to assess actual energy

usage and contribution by individual entities. While it would be

possible in such cases to simply revert to energy metrics of CPUs and

data centers as a whole, this loses the ability to account for those

metrics on the basis of networking decisions being made.

For example, virtualized networking functions could be hosted on

containers or virtual machines which are hosted on a CPU in a data

center instead of a regular network appliance such as a router or a

switch, leading to very different power consumption characteristics.

A data center CPU could be more power efficient and consume power

more proportionally to actual CPU load. Virtualization could result

in using fewer servers. [Energystar] reports that one watt-hour of

energy savings at the server level results in roughly 1.9 watt-hours

of facility-level energy savings by reducing energy waste in the

power infrastructure and reducing energy needed to cool the waste

heat produced by the server.

Instrumentation needs to reflect these facts and facilitate

attributing power consumption in a correct manner. Alternatively, a

simpler solution may be to simply forgo energy metrics for

virtualized functions entirely, instead focus on instrumenting and

relying on optimizing the energy footprint of the underlying hosting

infrastructure. In the meantime, the attribution of energy

consumption and carbon footprint to individual functions that run on

top of that infrastructure may be a topic for further research.

3.2. Energy Metrics related to Flows

Energy metrics related to flows attempt to capture the contribution

of a given flow to energy consumption. In its basic incarnation,

those metrics reflect the energy consumption at a given device. They

could be used in conjunction with IPFIX [RFC7011] and modeled as

Information Elements to be treated analogous to other flow statistics

[RFC7012]. The following is a corresponding set of flow energy

metrics:

o Incremental energy consumed over the duration of the flow.

This is the incremental energy consumption that is directly caused

by the flow, representing the difference between the amount of

energy consumed with the flow and the amount of energy that would

have been consumed without the flow. (It should be noted that

this metric may be difficult to assess in practice.)

o Amortized energy consumed over the duration of the flow.

This is the portion of the flow’s energy consumption for the

duration of the flow, effectively computed by computing the

proportion of flow traffic to overall traffic and multiplying it

with the total energy consumption incurred for that time.

A second set of energy metrics related to flow might aggregate the

flow’s energy consumption over the entire flow path. In that case,

the flow energy consumption is added up along the systems of the

traversed path. In practice, this will be more difficult to assess

for many reasons, including impacts of load balancing, PREOF (Packet

Replication, Elimination, and Ordering Functions [RFC8655]),

challenges to trace actual routes taken by production traffic, and

3.3. Energy Metrics related to Paths

Enerby metrics related to paths involve assessing the carbon

footprints of paths and optimizing those paths so that overall

footprint is minimized, then applying techniques such as path-aware

networking [I.D.draft-chunduri-rtgwg-preferred-path-routing] or

segment routing [RFC8402] to steer traffic along those paths that are

deemed "the greenest" among alternatives. It also includes aspects

such as considering the incremental energy usage in routing

decisions.

Optimizing cost has a long tradition in networking; many of the

existing mechanisms can be leveraged for greener networking simply by

introducing energy footprint as a cost factor. Low-hanging fruit

include the inclusion of energy-related parameters as a cost

parameter in control planes, whether distributed (e.g. IGP) or

conceptually centralized via SDN controllers. In addition to power

consumption over a path itself, other factors such as paths involving

intermediate routers that are powered by renewable energy resources

might be considered, as might be determined by an aggregate

sustainability score. After all, paths with devices that are powered

by solar, wind, or geothermal might be preferable over paths

involving devices powered by conventional energy that may include

fossil fuel or nuclear resources.

The following are a corresponding set of candidate metrics:

o Energy rating of a path. (This could be computed as a function of

energy ratings of different hops along the path.)

o Current power consumption across a path. (This could be computed

by aggregating the current power per packet (or per kilo octet

etc) of each of the hops along the path.)

o Incremental power for a packet over a path. (This could be

computed by aggregating the incremental power per packet of each

of the hops along the path.)

3.4. Energy Metrics related to the Network-at-Large

Ultimately, the goal of energy optimization and reduction of carbon

footprint is to minimize the aggregate amount of energy used across

the entire network, as well as to minimize the overall carbon

footprint of the network as a whole. Accordingly, metrics that

aggregate the energy usage across the network as a whole are needed.

In order to account for changing traffic profiles, growth in user

traffic etc, additional metrics are needed that normalize the total

over the volume of services supported and volume of traffic passed.

Corresponding metrics will generally be computed at the level of

Operational Support Systems (or Business Support Systems) for the

entire network.

Some of the metrics used include the following [telefonica2020]:

o Total energy consumption (MWh)

o Electricity from renewable sources (%)

o Network energy efficiency (MWh/PB)

4. Other considerations and discussion items

This document is intended to spark discussion about what energy

metrics will be useful to reduce the carbon footprint of networks -

that provide visibility into energy consumption, that help

optimization of networks under green criteria, that enable the next

generation of energy-aware controllers and services. Clearly, other

metrics are conceivable and more considerations apply beyond those

that are currently reflected in this document. The following

subsections highlight items that warrant further discussion and that

might be addressed in greater detail in future revisions of this

document.

4.1. User perspective

Arguably, attributing energy usage to individual users and making

users aware of the energy-implications of their communication

behavior may provide interesting possibilities to reduce energy

footprint by guiding their behavior accordingly. For example, the

network could present clients with energy statistics related to their

communication usage. This could be supported by metrics related to

service instances, such as energy usage statistics beyond statistics

regarding volume, duration, number of transactions. Such approaches

would raise questions about how to actually collect such statistics

accurately (versus just computing them via a formula) or what to

actually include as part of those statistics (amortized vs

incuremental energy contribution, attribution of cost for path

resilience or retransmissions due to congestion, etc). They also

raise questions about how they would in practice be used. For

example, energy-based charging might be explored as an alternative

for volume-based charging; however, in practice the two may be

strongly correlated and rejected by customers for similar reasons

that volume-based charging is frequently rejected.

4.2. Holistic perspective

The network itself is only one contributor to a network’s carbon

footprint. Arguably just as important are aspects outside the

network itself, such as cooling and ventilation. These aspects need

to be considered. However, reflecting such aspects here would

arguably result in "boiling the ocean" and are therefore not

addressed here.

4.3. Sustainable equipment production

Internet energy consumption may constitute two major components

[Raghavan2011]: (1) the energy of the devices that construct the

Internet, including the infrastructure devices: routers, LAN devices,

cellular and telecommunication infrastructure, (2) More broadly, with

the rise of peer-to-peer applications and cloud services, it also

considers the energy consumption of the end systems, including

desktops, laptops, smart phones, cloud servers, and application

servers that are not in the cloud.

For those two components, the following factors need to take into

consideration for energy consumption calculation:

o Energy consumed in manufacturing of the devices and end-systems,

as well as the contribution from their components and materials.

o The replacement lifespan of the devices and end-systems: desktops

and laptops are typically replaced in 3-4 years, smartphones in 2

years, application servers and cloud servers in 3 years, routers

and WiFi-LAN switches in 3 years, cellular towers and

telecommunication switches in 10 years, fiber optics in 10 years,

copper in 30 years, etc. With the incremental growth rate of the

technology advancement, the replacement lifespan might be

decreased over time.

o Operational maintenance: the network would not be functional

without various software and implementation of protocols. The

energy consumed in creating software is complicated because it is

overwhelmingly human involved, which usually include the energy

used for the facilities of the software companies and human energy

of the programmers.

o Replacement: The energy consumed in replacement of devices and

end-systems could vary. Some could be very energy intensive for

those large devices, e.g., cellular towers, or environmental

unfriendly equipment, such as submarine communication cables.

o Disposal: There is substantial energy cost in disposing and

recycling the old devices and equipment.

By combining the energy consumption for running each device that

builds the Internet [JuniperRouterPower], and the energy consumption

of the end systems, in the meantime counting the energy consumption

of manufacturing, operational maintenance, replacement and lifespan,

disposal of those devices and equipment, we may have an estimate of

the energy consumption for the network as a whole.

5. IANA Considerations

This document does not have any IANA requests.

6. Security Considerations

When instrumenting a network for energy metrics, it is important that

implementations are secured to ensure that data is accurately

measured and cannot be tampered with. For example, an attacker might

try to tamper energy readings to confuse controller trying to minize

power consumption, leading to increased power consumption instead.

In addition, access to the data needs to be secured in similar ways

as for other sensitive management data, for example using secure

management protocols and subjecting energy data that is maintained in

YANG datastores via NACM (NETCONF Access Control Model).

However, it should be noted that this draft specifies only metrics

themselves, not how to instrument networks accordingly. For the

definition of metrics themselves, security considerations do thus not

really apply.

7. Acknowledgments

Acknowledgments will be added when the time comes.

8. Informative References

[Ahn2014] Ahn, J. and H. S. Park, "Measurement and modeling the

power consumption of router interface",

DOI: 10.1109/ICACT.2014.6779082, 16th International

Conference on Advanced Communication Technology, pp.

860-863, 2014,

<https://ieeexplore.ieee.org/document/6779082>.

[ATIS0600015.02]

AITS, "Energy Efficiency for Telecommunication Equipment:

Methodology for Measurement and Reporting - Transport and

Optical Access Requirements", March 2016.

[Bolla2011]

Bolla, R., Bruschi, R., Lombardo, C., and D. Suino,

"Evaluating the energy-awareness of future Internet

devices", DOI: 10.1109/HPSR.2011.5986001, 2011 IEEE 12th

International Conference on High Performance Switching and

Routing, pp. 36-43, 2011,

[Energystar]

EnergyStar, "12 Ways to Save Energy in the Data Center,

Server Virtualization", 2022,

<https://www.energystar.gov/products/

low_carbon_it_campaign/12_ways_save_energy_data_center/

server_virtualization>.

[I.D.draft-chunduri-rtgwg-preferred-path-routing]

Bryant, S. E., Chunduri, U., and A. Clemm, "Preferred Path

Routing Framework", May 2022,

<https://datatracker.ietf.org/doc/html/draft-chunduri-

rtgwg-preferred-path-routing-01>.

[I.D.draft-cwx-green-ps]

Clemm, A. and C. Westphal, "Challenges and Opportunities

in Green Networking", June 2022.

[I.D.draft-manral-bmwg-power-usage]

Manral, V., "Benchmarking Power usage of networking

devices", Jan 2011.

[JuniperRouterPower]

Juniper, "Power Requirements for an MX960 Router", 2021.

[Raghavan2011]

Raghavan, B. and J. Ma, "The energy and emergy of the

Internet", HotNets-X: Proceedings of the 10th ACM Workshop

on Hot Topics in Networks, pp. 1-6, 2011,

<https://dl.acm.org/doi/10.1145/2070562.2070571>.

[RFC7011] (Ed.), B. C., (Ed.), B. T., and P. Aitken, "Specification

of the IP Flow Information Export (IPFIX) Protocol for the

Exchange of Flow Information", RFC 7011, September 2013,

<https://datatracker.ietf.org/doc/html/rfc7011>.

[RFC7012] (Ed.), B. C. and B. T. (Ed.), "Information Model for IP

Flow Information Export (IPFIX)", RFC 7012, September

2013, <https://datatracker.ietf.org/doc/html/rfc7012>.

[RFC7950] Bjorklund, M. E., "The YANG 1.1 Data Modeling Language",

RFC 7950, August 2016,

[RFC8402] (Ed.), C. F., (Ed.), S. P., Ginsberg, L., Decraene, B.,

Decraene, B., Litkowski, S., and R. Shakir, "Segment

Routing Architecture", RFC 8402, July 2018,

[RFC8655] Finn, N., Thubert, P., Varga, B., and J. Farkas,

"Deterministic Networking Architecture", RFC 8655, October

2019, <https://datatracker.ietf.org/doc/html/rfc8655>.

[telefonica2020]

Telefonica, "Telefonica Consolidated Annual Report 2020.",

Authors’ Addresses

Alexander Clemm

Futurewei

2220 Central Expressway

Santa Clara CA 95050

Email: ludwig@clemm.org

Lijun Dong

Futurewei

Santa Clara CA 95050

Email: lijun.dong@futurewei.com

Greg Mirsky

Ericsson

Email: gregimirsky@gmail.com

Laurent Ciavaglia

Rakuten Mobile

Email: laurent.ciavaglia@rakuten.com

Jeff Tantsura

Microsoft

Email: jefftant.ietf@gmail.com

Marie-Paule Odini

Email: mp.odini@orange.fr

Network Working Group A. Clemm

Internet-Draft C. Westphal

Intended status: Informational Futurewei

Expires: January 12, 2023 J. Tantsura

Microsoft

L. Ciavaglia

Rakuten Mobile

M-P. Odini

July 11, 2022

Challenges and Opportunities in Green Networking

draft-cx-green-ps-00

Abstract

Reducing technology’s carbon footprint is one of the big challenges

of our age. Networks are an enabler of applications that reduce this

footprint, but also contribute to this footprint substantially

themselves. The biggest opportunities to reduce the energy footprint

may not be networking specific, for instance general power efficiency

gains in hardware or hosting of equipment in more cooling-efficient

buildings. Yet methods to make networking technology itself

"greener" also need to be explored. This document outlines a

corresponding set of opportunities, along with associated research

challenges, for reducing this footprint and reducing network energy

demand.

Status of This Memo

Copyright Notice

Provisions Relating to IETF Documents

(https://trustee.ietf.org/license-info) in effect on the date of

publication of this document. Please review these documents

carefully, as they describe your rights and restrictions with respect

to this document. Code Components extracted from this document must

include Simplified BSD License text as described in Section 4.e of

the Trust Legal Provisions and are provided without warranty as

described in the Simplified BSD License.

Table of Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2

2. Definitions and Acronyms . . . . . . . . . . . . . . . . . . 5

3. Contributors to Network Energy Consumption . . . . . . . . . 5

4. Challenges and Opportunities - Equipment Level . . . . . . . 6

5. Challenges and Opportunities - Protocol Level . . . . . . . . 7

5.1. Data Volume Reduction . . . . . . . . . . . . . . . . . . 8

5.2. Traffic Adaptation . . . . . . . . . . . . . . . . . . . 9

5.3. Enabling Network Energy Saving Mechanisms . . . . . . . . 9

5.4. Network Addressing . . . . . . . . . . . . . . . . . . . 10

6. Challenges and Opportunities - Network Level . . . . . . . . 10

7. Challenges and Opportunities - Architecture Level . . . . . . 12

8. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 13

11. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 15

12. Informative References (TBD) . . . . . . . . . . . . . . . . 15

1. Introduction

Climate change and the need to curb greenhouse emissions have been

recognized by the United Nations and by most governments as one of

the big challenges of our time. As a result, improving energy

efficiency and reducing power consumption are becoming of increasing

importance for society and for many industries. The networking

industry is no exception.

Arguably, networks can already be considered "green" technology in

that networks enable many applications that allow users and whole

industries to save energy and become more sustainable in a

significant way. For example, it allows (at least to an extent) to

replace travel with teleconferencing; it enables many employees to

work from home and "telecommute," thus reducing the need for actual

commute; IoT applications that facilitate automated monitoring and

control from remote sites help make agriculture more sustainable by

minimizing the application of resources such as water and fertilizer;

networked smart buildings allow for greater energy optimization and

sparser use of lighting and HVAC (heating, ventilation, air

conditioning) than their non-networked not-so-smart counterparts.

That said, networks themselves consume significant amounts of energy.

Therefore, the networking industry has an important role to play in

meeting sustainability goals not just by enabling others to reduce

their reliance on energy, but by also reducing its own. Future

networking advances will increasingly need to focus on becoming more

energy-efficient and reducing carbon footprint, both for economic

reasons and for reasons of corporate responsibility. This shift has

already begun and sustainability is already becoming an important

concern for network providers. In some cases such as in the context

of networked data centers, the ability to procure enough energy

becomes a bottleneck prohibiting further growth and greater

sustainability thus becomes a business necessity.

For example, in its annual report, Telefonica reports that in 2020,

its network’s energy consumption per PB of data amounted to 78MWh

[telefonica2020]. This rate has has been dramatically decreasing (a

five-fold factor over five years) although gains in efficiency are

being offset by simultaneous growth in data volume. In the same

report, it is stated as an important corporate goal to continue on

that trajectory and reduce overall carbon emissions by 70% over the

next 5 years.

From a technical perspective, multiple vectors along which networks

can be made "greener" should be considered:

o At the equipment level. Perhaps the most promising vector for

improving networking sustainability concerns the network equipment

itself. At the most fundamental level, networks (even softwarized

ones) involve appliances, i.e. equipment that relies on electrical

power to perform its function. However, beyond making those

appliances merely energy-efficient, there are other important ways

in which equipment can help networks become greener. This

includes aspects such as support for port power saving modes

allowing to reduce power consumption for resources that are not

fully utilized, but also instrumentation that allows to precisely

monitor power usage at different levels of granularity, enabling

(for example) controllers applications that aim to optimize energy

usage across the network. (As a side note, the term "device", as

used in the context of this draft, is used to refer to networking

equipment. We are not taking into consideration end-user devices

and endpoints such as mobile phones or computing equipment.)

o At the protocol level. Energy-efficiency and greenness are

aspects that are rarely considered when designing network

protocols. This suggests that there may be plenty of untapped

potential. Some aspects involve designing protocols in ways that

reduce the need for redundant or wasteful transmission of data to

allow not only for better network utilization, but greater goodput

per unit of energy being consumed. Techniques include approaches

that reduce the "header tax" incurred by payloads as well as

methods resulting in the reduction of wasteful retransmissions.

Likewise, aspects such as restructuring addresses in ways that

allow to minimize the size of lookup tables and associated memory

sizes and their energy use can play a role as well. Another role

of protocols concerns the enabling of functionality to improve

energy efficiency at the network level, such as discovery

protocols that allow for quick adaptation to network components

being taken dynamically into and out of service depending on

network conditions.

o At the network level. Perhaps the greatest opportunities to

realize power savings exist at the level of the network as whole.

For example, optimizing energy efficiency may involve directing

traffic in such a way that it allows for isolation of equipment

that may at the moment not be needed so that it could be powered

down or brought into power-saving mode. By the same token,

traffic should be directed in a way that requires bringing

additional equipment online or out of power-saving mode in cases

where alternative traffic paths are available for which the

incremental energy cost would amount to zero. Likewise, some

networking devices may be more power-intensive than others whose

use might be avoided unless required to meet peak capacity

demands. Generally, incremental power consumption can be viewed

as a cost metric that networks should strive to minimize and

consider as part of routing and of network path optimization.

o At the architecture level. The current network architecture

supports a wide range of applications, but does not take into

account energy efficiency as one of its design parameters. One

can argue that the most energy efficient shift of the last two

decades has been the deployment of Content Delivery Network

overlays: while these were set up to reduce latency and minimize

bandwidth consumption, from a network perspective, retrieving the

content from a local cache is also much greener. What other

architectural shifts can produce energy consumption reduction?

We believe that network standardization organizations in general, and

IETF in particular, can make important contributions to each of these

vectors. In this document, we will there explore each of those

vectors in further detail and for each point out specific challenges

for IETF.

It should be noted that this document borrows heavily from material

from a prior paper, [GreenNet22]. This material has been both

expanded (for example, in terms of some of the opportunities) and

pruned (for example, in terms of background on prior scholarly work).

In addition, unlike the prior paper, this document focuses on and

attempts to articulate specific challenges as related to work that

could be championed by the IETF.

3. Contributors to Network Energy Consumption

When exploring possibilities to improve energy efficiency, it is

important to understand which aspects contribute to power consumption

the most and hence where the greatest potential for power savings

Power is ultimately drawn from devices. The power consumption of the

device can be divided into the consumption of the core device - the

backplane and CPU, if you will - as well as additional consumption

incurred per port and line card. Furthermore it is important to

understand the difference between power consumption when a resource

is idling versus when it is under load. This helps to understand the

incremental cost of additional transmission versus the initial cost

of transmission.

In typical networking devices, only roughly half of the energy

consumption is associated with the data plane [bolla2011energy]. An

idle base system typically consumes more than half of the power over

the same system running at full load [chabarek08], [cervero19]. This

means that a device’s power consumption increases not linearly with

the volume of forwarded traffic but resembles more of a step

function. Generally, the cost of the first bit is very high, as it

requires powering up a device, port, etc. The cost of transmission

of additional bits (beyond the first) is many orders of magnitude

lower. Likewise, the incremental cost of incremental CPU and memory

needed to process additional packets becomes fairly negligible. By

the same token, generally speaking it is more energy-efficient to

transmit a large volume of data in one burst (and turning off the

interface when idling), instead of continuously transmitting at a

lower rate. In that sense it can be the duration of the transmission

that dominates the energy consumption, not the actual data rate.

The implications on green networking from an energy-savings

standpoint are significant: Potentially the largest gains can be made

when network resources can effectively be taken off the grid (i.e.

isolated and removed from service so they can be powered down while

not needed). Likewise, for applications where this is possible, it

may be desirable to replace continuous traffic at low data rates with

traffic that is sent in burst at high data rates, in order to

potentially maximize the time during which resources can be idled.

At the same time, any non-idle resources should be utilized to the

greatest extent possible as the incremental energy cost is

negligible. Of course, this needs to occur while still taking other

operational goals into consideration, such as protection against

failures (allowing for readily-available redundancy and spare

capacity in case of failure) and load balancing (for increased

operational robustness). As data transmission needs tend to

fluctuate wildly and occur in bursts, any optimization schemes need

to be highly adaptable and allow for very short control loops.

As a result, emphasis needs to be given to technology that allows to

(for example) (at the device level) exercise very efficient and rapid

discovery, monitoring, and control of networking resources so that

they can be dynamically be taken offline or back into service,

without (at the network level) requiring extensive convergence of

state across the network or recalculation of routes and other

optimization problems, and (at the network equipment level) support

rapid power cycle and initialization schemes.

4. Challenges and Opportunities - Equipment Level

Perhaps the most obvious opportunities to make networking technology

more energy efficient exist at the equipment level. After all,

networking involves physical equipment to receive and transmit data.

Making such equipment more power efficient, have it dissipate less

heat to consume less energy and reduce the need for cooling, making

it eco-friendly to deploy, sourcing sustainable materials and

facilitating recycling of equipment at the end of its life-cycle all

contribute to making networks greener. More specific and unique to

networking are schemes to reduce energy usage of transmission

technology from wireless (antennas) to optical (lasers).

Beyond such "first-order" opportunities, network equipment just as

importantly plays an important role to enable and support green

networking at other levels. Of prime importance is the equipment’s

ability to provide visibility to management and control plane into

its current energy usage. Such visibility enables control loops for

energy optimization schemes, allowing applications to obtain feedback

regarding the energy implications of their actions, from setting up

paths across the network that require the least incremental amount of

energy to quantifying metrics related to energy cost used to optimize

forwarding decisions.

One prerequisite to such schemes is to have proper instrumentation in

place that allows to monitor current power consumption at the level

of networking devices as a whole, line cards, and individual ports.

Such instrumentation should also allow to assess the energy

efficiency and carbon footprint of the device as a whole. In

addition, it would be desirable to relate this power consumption to

data rates as well as to current traffic, for example, to indicate

current energy consumption relative to interface speeds, as well as

for incremental energy consumption that is expected for incremental

traffic (to aid control schemes that aim to "shave" power off current

services or to minimize the incremental use of power for additional

traffic). This is an area where the current state of the art is

sorely lacking and standardization lags behind; for example, as of

today, no corresponding standardized YANG data models [RFC7950] for

network energy consumption that can be used in conjunction with

management and control protocols have been defined.

Instrumentation should also take into account the possibility of

virtualization, introducing layers of indirection to assess the

actual energy usage. For example, virtualized networking functions

could be hosted on containers or virtual machines which are hosted on

a CPU in a data center instead of a regular network appliance such as

a router or a switch, leading to very different power consumption

characteristics. For example, a data center CPU could be more power

efficient and consume power more proportionally to actual CPU load.

Instrumentation needs to reflect these facts and facilitate

attributing power consumption in a correct manner.

Beyond monitoring and providing visibility into power consumption,

control knobs are needed to configure energy saving policies. For

instance, power saving modes are common in endpoints (such as mobile

phones or notebook computers) but sorely lacking in networking

equipment.

5. Challenges and Opportunities - Protocol Level

There are several opportunities for energy savings at the protocol

level. We characterize them along three main categories: protocols

designed to reduce the volume of data to be transmitted; protocols

designed to optimize data transmission rates under energy

considerations; and protocols that enable energy optimization schemes

at the network level. A fourth category, "other", is used to capture

any other aspects not easily categorized into the other three.

5.1. Data Volume Reduction

The first category involves designing protocols in such a way that

they reduce the volume of data that needs to be transmitted for any

given purpose. Loosely speaking, by reducing this volume, more

traffic can be served by the same amount of networking

infrastructure, hence reducing overall energy consumption.

Possibilities here include protocols that avoid unnecessary

retransmissions. At the application layer, protocols may also use

coding mechanisms that encode information close to the Shannon limit.

Currently, most of the traffic over the Internet consists of video

streaming and encoders for video are already quite efficient and keep

improving all the time, resulting in energy savings as one of many

advantages (of course being offset by increasingly higher

resolution). However, it is not clear that the extra work to achieve

higher compression ratios for the payloads results in a net energy

gain: what is saved over the network may be offset by the

compression/decompression effort. Further research on this aspect is

necessary.

At the transport protocol layer, TCP and to some extent QUIC react to

congestion by dropping packets. This is a highly energy inefficient

method to signal congestion, since the network has to wait one RTT to

be aware that the congestion has occurred, and since the effort to

transmit the packet from the source up until it is dropped ends up

being wasted. This calls for new transport protocols that react to

congestion without dropping packets. ECN[RFC2481] is a possible

solution, however not widely deployed. DC-TCP [alizadeh2010DCTCP] is

tuned for the Data Center. Qualitative Communication [QUAL]

[westphal2021qualitative] allows the nodes to react to congestion by

dropping only some of the data in the packet, thereby only partially

wasting the resource consumed by transmitted the packet up to this

point. We believe there is a need for novel transport protocols for

the WAN that ensures that no energy is wasted transmitting packets

that will be eventually dropped.

Another solution to reduce the bandwidth of network protocols by

reducing their header tax, for example applying header compression.

An example in IETF is [RFC3095]. Again, reducing protocol header

size saves energy to forward packets, but at the cost of maintaining

a state for compression/decompression, plus computing these

operations. The gain from such protocol optimization further depends

on the application and whether it sends packets with large payloads

close to the MTU (the header tax and any savings here are very

limited), or whether it sends packets with very small payload size

(making the header tax more pronounced and savings more significant).

An alternative to reducing the amount of protocol data is to design

routing protocols that are more efficient to process at each node.

For instance, path based forwarding/labels such as MPLS [RFC3031]

facilitate the next hop look-up, thereby reducing the energy

consumption. It is unclear if some state at router to speed up look

up is more energy efficient that "no state + lookup" that is more

computationally intensive. Other methods to speed up a next-hop

lookup include geographic routing (e.g. [herzen2011PIE]). Some

network protocols could be designed to reduce the next hop look-up

computation at a router. It is unclear if Longest Prefix Match (LPM)

is inefficient from an energy point of view, or if it is a

significant energy budget cost for the operation of a router.

5.2. Traffic Adaptation

The second category involves designing protocols in such a way that

the rate of transmission is chosen to maximize energy efficiency.

For example, Traffic Engineering (TE) can be manipulated to impact

the rate adaptation mechanism [ren2018jordan]. By choosing where to

send the traffic, TE can artificially congest links so as to trigger

rate adaptation and therefore reduce the total amount of traffic.

Most TE systems attempt to minimize Maximal Link Utilization (MLU)

but energy saving mechanisms could decide to do the opposite

(maximize minimial link utilization) and attempt to turn off some

resources to save power.

5.3. Enabling Network Energy Saving Mechanisms

Novel protocols are also needed in two dimensions: to discover what

links are available and/or energy efficient. For instance, links may

be turned off in order to save energy, and turned back on based upon

the elasticity of the demand. Protocols should be devised to

discover when this happens, and to have a view of the topology that

is consistent with frequent topology updates due to power cycling of

the network resources.

Also, protocols are required to quickly converge onto an energy-

efficient path once a new topology is created by turning links on/

off. Current routing protocols may provide for fast recovery in the

case of failure. However, failures are hopefully relatively rare

events, while we expect an energy efficient network to aggressively

try to turn off links.

Some mechanism is needed to present to the management layer a view of

the network that identifies opportunities to turn resources off

(routers/links) while still providing some decent level of Quality of

Experience (QoE) to the users. This gets more complex as the level

of QoE shifts from the current Best Effort delivery model to more

sophisticated mechanisms with, for instance, latency, bandwidth or

reliability guarantees.

5.4. Network Addressing

There are other ways to shave off energy usage from networks. One

example concerns network addressing. Address tables can get very

large, resulting in large forwarding tables that require considerable

amount of memory, in addition to large amounts of state needing to be

maintained and synchronized. From an energy footprint perspective,

both can be considered wasteful and offer opportunities for

improvement. At the protocol level, rethinking how addresses are

structured can allow for flexible addressing schemes that can be

exploited in network deployments that are less energy-intensive by

design. This can be complemented by supporting clever address

allocation schemes that minimize the number of required forwarding

entries as part of deployments.

6. Challenges and Opportunities - Network Level

Networks have been optimized for many years under many criteria, for

example to optimize (maximize) network utilization and to optimize

(minimize) cost. Hence, it is straighforward to add optimization for

"greenness" (including energy efficiency, power consumption, carbon

footprint) as important criteria.

This includes assessing the carbon footprints of paths and optimizing

those paths so that overall footprint is minimized, then applying

techniques such as path-aware networking or segment routing [RFC8402]

to steer traffic along those paths. It also includes aspects such as

considering the incremental energy usage in routing decisions.

Optimizing cost has a long tradition in networking; many of the

existing mechanisms can be leveraged for greener networking simply by

introducing energy footprint as a cost factor. Low-hanging fruit

include the inclusion of energy-related parameters as a cost

parameter in control planes, whether distributed (e.g. IGP) or

conceptually centralized via SDN controllers.

Other opportunities concern adding energy-awareness to dynamic path

selection schemes, requiring corresponding instrumentation as

mentioned earlier. Again, considerable energy savings can

potentially be realized by taking resources offline (e.g. putting

them into power-saving or hibernation mode) when they are not

currently needed under current network demand and load conditions.

Therefore, weaning such resources from traffic becomes an important

consideration for energy-efficient traffic steering. This contrasts

and indeed conflicts with existing schemes that typically aim to

create redundancy and load-balance traffic across a network to

achieve even resource utilization. This usually occurs for important

reasons, such as making networks more resilient, optimizing service

levels, and increasing fairness. One of the big challenges hence

concerns how resource weaning schemes to realize energy savings can

be accommodated while preventing the cannibalization of other

important goals, counteracting other established mechanisms, and

avoiding destabilization of the network.

As an important prerequisite to capture many of those opportunities,

good abstractions (and corresponding instrumentation) that allow to

easily assess energy cost and carbon footprint will be required.

These abstractions need to account for not only for the energy cost

associated with packet forwarding across a given path, but related

cost for processing, for memory, for maintaining of state, to result

in a holistic picture. Optimization of carbon footprint involves in

many cases trade-offs that involve not only packet forwarding but

also aspects such as keeping state, caching data, or running

computations at the edge instead of elsewhere. (Note: there may be a

differential in running a computation at an edge server vs. at an

hyperscale DC. The latter is often better optimized than the

latter.) Likewise, other aspects of carbon footprint beyond mere

energy-intensity should be considered. For instance, some network

segments may be powered by more sustainable energy sources than

others, and some network equipment may be more environmentally-

friendly to build, deploy and recycle, all of which can be reflected

in abstractions to consider.

A related set of challenges concerns the fact that such schemes

result in much greater dynamicity and continuous change in the

network as resources may be getting steered away from (when possible)

and then leveraged again (when necessary) in rapid succession. This

imposes significant stress on convergence schemes that results in

challenges to the scalability of solutions and their ability to

perform in a fast-enough manner. Network-wide convergence imposes

high cost and incurs significant delay and is hence not susceptible

to such schemes. The impact will in all likelihood needs to be

mechanisms that do not require convergence beyond the vicinity of the

affected network device. Especially in cases where central network

controllers are involved that are responsible for aspects such as

configuration of paths and the positioning of network functions and

that aim for global optimization, the impact of churn needs to be

minimized. This means that, for example, extensive recalculation

e.g. of routes and paths based on the current energy state of the

network needs to be avoided.

An opportunity may lie in making a distinction between "energy modes"

of different domains. For instance, in a highly trafficked core, the

energy challenge is to transmit the traffic efficiently. The amount

of traffic is relatively fluid (due to multiplexing of multiple

sessions) and the traffic is predictable. In this case, there is no

need to optimize on a per session basis nor even at a short time

scale. In the access networks connecting to that core, though, there

are opportunites for this fast convergence: traffic is much more

bursty, less predictable and the network should be able to be more

reactive. Other domains such as DCs may have also more variable

workloads and different traffic patterns.

7. Challenges and Opportunities - Architecture Level

Another possibility to improve network energy efficiency is to

organize networks in a way that they can best serve important

applications so as to minimize energy consumption. Examples include

retrieval of content or remote computation. This allows to minimize

the amount of communication that needs to take place in the first

place, although energy savings within the network may at least in

part be offset by additional energy consumption elsewhere. The

following are some examples that suggest that it may be worthwhile

reconsidering the ways in which networks are architected to minimize

their carbon footprint.

For example, Content Delivery Networks (CDNs) have reduced the energy

expenditure of the Internet by downloading content near the users.

The content is sent only a few times over the WAN, and then is served

locally. This shifts the energy consumption from networking to

storage. Further methods can reduce the energy usage even more

[bianco2016energy][mathew2011energy][islam2012evaluating]. Whether

overall energy savings are net positive depends on the actual

deployment, but from the network operator’s perspective, at least it

shifts the energy bill away from the network to the CDN operator.

While CDNs operate as an overlay, another architecture has been

proposed to provide the CDN features directly in the network, namely

Information Centric Networks [ahlgren2012survey], studied as well in

the IRTF ICNRG. This however shifts the energy consumption back to

the network operator and requires some power-hungy hardware, such as

chips for larger name look-ups and memory for the in-network cache.

As a result, it is unclear if there is an actual energy gain from the

dissemination and retrieval of content within in-network caches.

Fog computing and placing intelligence at the edge are other

architectural directions for reducing the amount of energy that is

spent on packet forwarding and in the network. There again, the

trade-off is between performing computation in a an energy-optimized

data center at very large scale, but requiring transmission of

significant volumes of data across many nodes and long distances,

versus performing computational tasks at the edge where the energy

may not be used as efficiently (less multiplexing of resources, and

smaller sites are inherently less efficient due to their smaller

scale) but the amount of long-distance network traffic is

significantly reduced. Softwarization, containers, microservices are

direct enablers for such architectures, and the deployment of

programmable network infrastructure (as for instance Infrastructure

Processing Units - IPUs or smartNICs that offload some computations

from the CPU onto the NIC) will help its realization. However, the

power consumption characteristics of CPUs are different from those of

NPUs, another aspect to be considered in conjunction with

virtualization.

Other possibilities concern taking economic aspects into

consideration impact, such as providing incentives to users of

networking services in order to minimize energy consumption and

emission impact. An example for this is given in

[wolf2014choicenet], which could be expanded to include energy

incentives.

Other approaches consider performing a late binding of data and

functions to be performed on the data [krol2017NFaaS]. The COIN

Research Group in IRTF focuses on similar issues. Jointly optimizing

for the total energy cost, taking into account networking and

computing (and the different energy cost of computing in an

hyperscale DC vs an edge node) is still an area of open research.

In summary, rethinking of the overall network (and networked

application) architecture can be an opportunity to significantly

reduce the energy cost at the network layer, for example by

performing tasks that involve massive communications closer to the

user. To what extend these shifts result in a net reduction of

carbon footprint is an important question that requires further

analysis on a case-by-case basis.

8. Conclusions

How to make networks "greener" and reduce their carbon footprint is

an important problem for the networking industry to address, both for

societal and for economic reasons. This document has highlighted

some of the technical challenges and opportunities in that regard,

for example:

o Equipment instrumentation advances for improved energy-awareness,

definition and standardization of granular management information;

o Protocol advances for improving the ratio of goodput to throughput

and to reduce waste: reduction in header tax, in protocol

verbosity, improvements in coding, etc.

o Protocol advances to enable rapidly taking down, bring back

online, and discover availability and power saving status of

networking resources while minimizing the need for reconvergence

and propagation of state;

o Network advances to allow to dynamically take resources offline

where feasible while minimizing churn;

o Energy footprint aware traffic steering and routing; carbon

footprint as a traffic cost metric to optimize;

o Reorganization of networking architecture for important classes of

applications (examples: content delivery, right-placing of

computational intelligence) to optimize green foot print and

holistic approaches to trade off carbon footprint between

forwarding, storage, and computation;

o Security issues imposed by greater energy awareness, to minimize

the new attack surfaces that would allow an adversary to turn off

resources, or to waste energy;

o Reliability issues for a network that relies on fewer resource

diversity, and with more operational complexity.

Of those, perhaps the key challenge to address right away concerns

the ability to expose at a fine granularity the energy impact of any

networking actions. Providing visibility into this will enable many

approaches to come towards a solution. It will be key to

implementing optimization via control loops that allow to assess the

energy impact of decisiont taken. It will also help to answer

questions such as: is caching - with the associated storage energy -

better than retransmitting from a different server - with the

associated networking cost? Is compression more energy-efficient

once factoring the computation cost of compression vs transmitting

uncompressed data? Which compression scheme is more energy

efficient? Is energy saving of computing at an efficient hyperscale

DC compensated by the networking cost to reach that DC? Is the

overhead of gathering and transmitting fine-grained energy telemetry

data offset by the total energy gain by ways of better decisions that

this data enables? Is transmitting data to a LEO constellation

compensated by the fact that once in the constellation, the

networking is fueled on solar energy? Is the energy cost of sending

rockets to place routers in Low Earth Orbit amortized over time?

Determining where the sweet spots are and optimizing networks along

those lines will be a key towards making networks "greener". We

expect to see significant advances across these areas and believe

that IETF has an important role to play in facilitating this.

This document does not have any IANA requests.

Security considerations may appear to be orthogonal to green

networking considerations. However, there are a number of important

caveats.

Security vulnerabilities of networks may manifest themselves in

compromised energy efficiency. For example, attackers could aim at

increasing energy consumption in order to drive up attack victims’

energy bill. Specific vulnerabilities will depend on the particular

mechanisms. For example, in the case of monitoring energy

consumption data, tampering with such data might result in

compromised energy optimization control loops. Hence any mechanisms

to instrument and monitor the network for such data need to be

properly secured to ensure authenticity.

In some cases there are inherent tradeoffs between security and

maximal energy efficiency that might otherwise be achieved. An

example is encryption, which requires additional computation for

encryption and decyption activities and security handshakes, in

addition to the need to send more traffic than necessitated by the

entropy of the actual data stream. Likewise, mechanisms that allow

to turn resources on or off could become a target for attackers.

11. Acknowledgments

Acknowledgments will be added at a later stage.

12. Informative References (TBD)

[ahlgren2012survey]

Ahlgren, B., Dannewitz, C., Imbrenda, C., Kutscher, D.,

and B. Ohlman, "A survey of information-centric

networking", IEEE Communications Magazine Vol.50 No.7,

[alizadeh2010DCTCP]

Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel,

P., Prabhakar, B., Sengupta, S., and M. Sridharan, "Data

Center TCP (DCTCP)", ACM SIGCOMM pp.63-74, 2010.

[bianco2016energy]

Bianco, A., Mashayekhi, R., and M. Meo, "Energy

consumption for data distribution in content delivery

networks", IEEE International Conference on Communications

(ICC) pp.1-6, 2016.

[bolla2011energy]

Bolla, R., Bruschi, R., Davoli, F., and F. Cucchietti,

"Energy Efficiency in the Future Internet: A Survey of

Existing Approaches and Trends in Energy-Aware Fixed

Network Infrastructures", IEEE Communications Surveys and

Tutorials Vol.13 No.2, pp.223-244, 2011.

[cervero19]

Cervero, A. G., Chincoli, M., Dittmann, L., Fischer, A.,

and A. Garcia, "Green Wired Networks", Wiley Journal on

Large-Scale Distributed Systems and Energy

Efficiency pp.41-80, 2019.

[chabarek08]

Chabarek, J., Sommers, J., Barford, P., Tsiang, D., and S.

Wright, "Power awareness in network design and routing",

IEEE Infocom pp.457-465, 2008.

[GreenNet22]

Clemm, A. and C. Westphal, "Challenges and Opportunities

in Green Networking", 1st International Workshop on

Network Energy Efficiency in the Softwarization Era IEEE

NetSoft 2022, June 2022.

[herzen2011PIE]

Herzen, J., Westphal, C., and P. Thiran, "Scalable routing

easy as PIE: A practical isometric embedding protocol",

19th IEEE International Conference on Network Protocols

(ICNP) pp.49-58, 2011.

[islam2012evaluating]

Islam, S. U. and J. Pierson, "Evaluating Energy

Consumption in CDN Servers", Proceedings of the Second

International Conference on ICT as Key Technology against

Global Warming pp.64-78, 2012.

[krol2017NFaaS]

Krol, M. and I. Psaras, "NFaaS: Named Function as a

Service", ACM SIGCOMM ICN Conference , 2017.

[mathew2011energy]

Mathew, V., Sitaraman, R., and P. Shenoy, "Energy-Aware

Load Balancing in Content Delivery Networks", CoRR

http://arxiv.org/abs/1109.5641 , 2011.

[QUAL] Li, R., Makhijani, K., Yousefi, H., Westphal, C., Xong,

L., Wauters, T., and F. D. Turck, "A framework for

Qualitative Communications using Big Packet Protocol",

Proceedings ACM Sigcomm Workshop On Networking For

Emerging Applications And Technologies pp.22-28, 2019.

[ren2018jordan]

Ren, J., Ren, K., Westphal, C., Wang, J., Wang, J., Song,

T., Liu, S., and J. Wang, "JORDAN: A Novel Traffic

Engineering Algorithm for Dynamic Adaptive Streaming over

HTTP", IEEE International Conference on Computing,

Networking and Communications (ICNC) pp.581-587, 2018.

[RFC2481] Ramakrishnan, K. and S. Floyd, "A Proposal to add Explicit

Congestion Notification (ECN) to IP", RFC 2481,

DOI 10.17487/RFC2481, January 1999,

<https://www.rfc-editor.org/info/rfc2481>.

[RFC3031] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol

Label Switching Architecture", RFC 3031,

DOI 10.17487/RFC3031, January 2001,

[RFC3095] Bormann, C., Burmeister, C., Degermark, M., Fukushima, H.,

Hannu, H., Jonsson, L-E., Hakenberg, R., Koren, T., Le,

K., Liu, Z., Martensson, A., Miyazaki, A., Svanbro, K.,

Wiebke, T., Yoshimura, T., and H. Zheng, "RObust Header

Compression (ROHC): Framework and four profiles: RTP, UDP,

ESP, and uncompressed", RFC 3095, DOI 10.17487/RFC3095,

July 2001, <https://www.rfc-editor.org/info/rfc3095>.

[RFC7950] Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language",

RFC 7950, DOI 10.17487/RFC7950, August 2016,

[RFC8402] Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L.,

Decraene, B., Litkowski, S., and R. Shakir, "Segment

Routing Architecture", RFC 8402, DOI 10.17487/RFC8402,

[telefonica2020]

Telefonica, "Consolidated Management Report 2020", 2021.

[westphal2021qualitative]

Westphal, C., He, D., Makhijani, K., and R. Li,

"Qualitative Communications for Augmented Reality and

Virtual Reality", 22nd IEEE International Conference on

High Performance Switching and Routing (HPSR) pp.1-6,

[wolf2014choicenet]

Tilman, W., Griffioen, J., Calvert, L., Dutta, R.,

Rouskas, G., Baldin, I., and A. Nagurney, "ChoiceNet:

Toward an Economy Plane for the Internet", SIGCOMM

Computer Communciations Review Vol.44 No.3, July 2014.

Alexander Clemm

Futurewei

Santa Clara, CA 95050

Email: ludwig@clemm.org

Cedric Westphal

Futurewei

Email: cedric.westphal@futurewei.com

Jeff Tantsura

Microsoft

Laurent Ciavaglia

Rakuten Mobile

Email: laurent.ciavaglia@rakuten.com

Marie-Paule Odini

Email: mp.odini@orange.fr

Network Working Group T. Eckert, Ed.

Internet-Draft Futurewei Technologies USA

Intended status: Informational M. Boucadair

Expires: 12 January 2023 Orange

P. Thubert

Cisco Systems, Inc.

J. Tentsura

Microsoft

11 July 2022

IETF and Energy - An Overview

draft-eckert-ietf-and-energy-overview-03

Abstract

This memo provides an overview of work performed by or proposed

within the IETF related to energy and/or green: awareness,

management, control or reduction of consumption of energy, and

sustainability as it related to the IETF.

This document is written to help those unfamiliar with the work but

interested in it, in the hope to raise more interest in energy-

related activities within the IETF, such as identifying gaps and

investigating solutions as appropriate.

Status of This Memo

This Internet-Draft will expire on 12 January 2023.

Copyright Notice

Eckert, et al. Expires 12 January 2023 [Page 1]

Internet-Draft energy-overview July 2022

Provisions Relating to IETF Documents (https://trustee.ietf.org/

license-info) in effect on the date of publication of this document.

Please review these documents carefully, as they describe your rights

and restrictions with respect to this document.

Table of Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3

2. Energy Saving: An Introduction . . . . . . . . . . . . . . . 3

2.1. Digitization . . . . . . . . . . . . . . . . . . . . . . 4

2.2. Energy Saving Through Scale . . . . . . . . . . . . . . . 4

2.2.1. An Example: Telephony . . . . . . . . . . . . . . . . 5

2.2.2. The Packet Multiplexing Principle . . . . . . . . . . 5

2.2.3. End-to-End Transport . . . . . . . . . . . . . . . . 5

2.2.4. Global vs Restricted Connectivity: The Internet Routing

Architectures . . . . . . . . . . . . . . . . . . . . 5

2.2.5. Freedom to Innovate . . . . . . . . . . . . . . . . . 6

2.2.6. End-to-End Encryption . . . . . . . . . . . . . . . . 6

2.2.7. Converged Networks . . . . . . . . . . . . . . . . . 6

2.2.7.1. IntServ and DetNet . . . . . . . . . . . . . . . 7

2.2.7.2. DiffServ . . . . . . . . . . . . . . . . . . . . 7

2.2.7.3. SIP . . . . . . . . . . . . . . . . . . . . . . . 7

3. Higher or New Energy Consumption . . . . . . . . . . . . . . 8

4. Some Notes on Sustainability . . . . . . . . . . . . . . . . 9

4.1. Follow the Energy Cloud Scheduling . . . . . . . . . . . 9

4.2. Minimize Generated Heat . . . . . . . . . . . . . . . . . 10

4.3. Heat Recovery . . . . . . . . . . . . . . . . . . . . . . 10

4.4. Telecollaboration . . . . . . . . . . . . . . . . . . . . 10

5. Energy Optimization in Specific Networks . . . . . . . . . . 12

5.1. Low Power and Lossy Networks (LLN) . . . . . . . . . . . 12

5.1.1. 6LOWPAN WG . . . . . . . . . . . . . . . . . . . . . 13

5.1.2. LPWAN WG . . . . . . . . . . . . . . . . . . . . . . 13

5.1.3. 6TISCH WG . . . . . . . . . . . . . . . . . . . . . . 13

5.1.4. 6LO WG . . . . . . . . . . . . . . . . . . . . . . . 14

5.1.5. ROLL WG . . . . . . . . . . . . . . . . . . . . . . . 14

5.2. Constrained Nodes and Networks . . . . . . . . . . . . . 15

5.2.1. LWIG WG . . . . . . . . . . . . . . . . . . . . . . . 15

5.2.2. CoRE and CoAP . . . . . . . . . . . . . . . . . . . . 15

5.2.3. Satellite Constellations . . . . . . . . . . . . . . 16

5.2.4. Devices with Batteries . . . . . . . . . . . . . . . 16

5.3. Sample Technical Enablers . . . . . . . . . . . . . . . . 17

5.3.1. (IP) Multicast . . . . . . . . . . . . . . . . . . . 17

5.3.1.1. Power Saving with Multicast . . . . . . . . . . . 17

5.3.1.2. Power Waste Through Multicast-based Service

Coordination . . . . . . . . . . . . . . . . . . . 18

5.3.1.3. Multicast Problems in Wireless Networks . . . . . 18

5.3.2. Sleepy Nodes . . . . . . . . . . . . . . . . . . . . 19

5.4. (Lack of) Power Benchmarking Proposals . . . . . . . . . 20

6. Energy Management Networks . . . . . . . . . . . . . . . . . 21

6.1. Smart Grid . . . . . . . . . . . . . . . . . . . . . . . 21

6.2. Syncro Phasor Networks . . . . . . . . . . . . . . . . . 22

7. (Limited) Energy Management for Networks . . . . . . . . . . 23

7.1. Some Metrics . . . . . . . . . . . . . . . . . . . . . . 23

7.2. EMAN WG . . . . . . . . . . . . . . . . . . . . . . . . . 23

8. Power-awareness in Forwarding and Routing Protocols . . . . . 25

8.1. Power Aware Networks (PANET) . . . . . . . . . . . . . . 25

8.2. SDN-based Semantic Forwarding . . . . . . . . . . . . . . 26

8.3. Misc . . . . . . . . . . . . . . . . . . . . . . . . . . 26

9. Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

10. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

11. Changelog . . . . . . . . . . . . . . . . . . . . . . . . . . 27

12. Informative References . . . . . . . . . . . . . . . . . . . 28

1. Introduction

This document summarizes work that has been proposed to or performed

within the IETF/IRTF. Particularly, it covers IETF/IRTF RFCs as well

as ISE RFCs and IETF/IRTF or individual submission drafts that where

abandoned for various reasons (e.g., lack of momentum, broad scope).

There are various aspects how a given work can relate to energy that

are classified into categories. Such a classification does not

attempt to propose a formal taxonomy but it is used for the sake of

better readability. Technologies are listed under a category that is

specifically significant, for example, by being most narrow.

This memo usually refers to the technologies by significant early RFC

or specific draft version, as opposed to the newest. This is

contrary to the common practice in IETF documents to refer to the

newest version. This is done because it allows readers to better

understand the historic timeline in which a specific technology was

introduced. Especially successful IETF technologies will have newer

RFC that updates such initial work.

2. Energy Saving: An Introduction

Technologies that simply save energy compared to earlier/other

alternatives are the broadest and most unspecific category. In this

memo such an energy saving simply refers to energy savings in some

unit of electricity, such as kWh and does not take other aspects into

account. See Section 4 for more details.

2.1. Digitization

Digitization describes the transformation of processes from non or

less digital with networking to more digital with computer-

networking. For comparable process results, the digitized option is

often, but not always, less energy consuming. Consider for example

energy consumption in the evolution of messaging starting from postal

mail and overs telegrams and various other historic form to solutions

including e-mail utilizing for example the IETF "Simple Mail

Transport Protocol" (SMTP, [RFC822]), group communications utilizing

the IETF "Network News Transport Protocol" (NNTP, [RFC3977]) or the

almost infinite set of communication options built on top of the IETF

"HyperText Transport Protocol" (HTTP, [RFC2086] and successors) and

IETF "HyperText Markup Language" (HTML, [RFC1866] and successors).

Traditionally, digitization had only "incidental", but not

"intentional" relationship to energy consumption: If it saved energy,

this was not only not a target benefit, it was not even recognized as

one, until probably recently. Instead, the evolution was driven from

anything-but-energy benefits, but instead utility benefits such as

improved speed, functionality/flexibility, accessibility, scalability

and reduced cost.

In hindsight though, digitization through IETF technologies and

specifically the Internet will likely have the largest contribution

to energy saving amongst all the possible categories, but it is also

the hardest to pinpoint on any specific technology/RFC. Instead, it

is often a combination of the whole stack of deployed protocols and

operational practices that contributes to energy saving through

digitization. It is likely also the biggest overall energy saving

impact of all possible categories that relate IETF work to energy:

The Internet as well as all other TCP/IP networks are likely the

biggest energy saving development of the past few decades if only the

energy consumption of equivalent services is compared. On the other

hand, they are also the cause for the biggest new type of new energy

consumption because of all the new services introduced in the past

decades with the Internet and the hyper-scaling that the Internet

affords them.

2.2. Energy Saving Through Scale

2.2.1. An Example: Telephony

In most cases, energy saving through the use of IETF protocols

compared to earlier (digitized or non digitized) solutions is purely

a result of the reduction in the energy cost per bit over the decades

in networking. For example, the energy consumption of digital voice

telephony through the IETF "Session Initiation Protocol" (SIP,

[RFC2543] and successors) can easily be assumed to be more energy

efficient on a per voice-minute basis than prior voice technologies

such as analog or digital "Time Division Multiplex" (TDM) telephony

solely because of this evolution in mostly device as well as

physical-layer and link-layer networking technologies.

2.2.2. The Packet Multiplexing Principle

Nevertheless, it is at the heart of the packet multiplexing model

employed by the IETF networking protocols IP ([RFC791]) and IPv6

([RFC1883] and successors) to successfully support this scaling that

brough down the cost per bit through ever faster links and network

nodes, especially for networks larger than building scale networks.

While the IETF protocols have not been the first or over their early

decades necessarily the most widely deployed packet networking

protocols, they where the ones who at least during the 1990th started

to break away from other protocols both in scale of deployment, as

well as in development of further technologies to support this

scaling.

2.2.3. End-to-End Transport

At the core of scalability, even up to now, is the lightweight per-

packet-processing enabled through end-to-end congestion loss

management architecture as embodied through the IETF "Transmission

Control Protocol" (TCP, [RFC793] and successors, e.g.

[I-D.ietf-tcpm-rfc793bis]). This model eliminated more expensive

per-hop, per-packet processing, such as would be required for

reliable hop-by-hop forwarding through per-hop ARQ, which was key to

scaling routers cost effectively.

2.2.4. Global vs Restricted Connectivity: The Internet Routing

Architectures

The meshed peer-to-peer and transitive routing of the Internet

enabled through the IETF "Border Gateway (Routing) Protocol (BGP,

[RFC4271] as well as predecessors) is another key factor to

successful scalability, because it enabled competitive market forces

to explore markets quickly.

Prior to the Internet, the public often only had access to highly

regulated international networking connections through often per-

country monopoly regulated data networks.

2.2.5. Freedom to Innovate

(non-IP) networks often also did not allow as much "freedom-to-

innovate" (as it is often called in the IETF) for applications

running over it. Instead those networks where exploring the coupling

of packet transport with higher layer services to allow the network

operator some degree of revenue sharing with the services running on

top of it. Such approaches resulted not only in higher cost of those

services but also (likely) preferential and (often) exclusionary

treatment of network traffic not fitting the perceived highest

revenue service options.

2.2.6. End-to-End Encryption

When the same business practices where applied to IP network, it was

one of the key factors leading to the development of IETF end-to-end

encryption though protocols such as "Transport Layer Security" (TLS,

[RFC2246] and its successors). This further strengthened the ability

to scale service/applications at minimum additional cost for the

underlying packet transport, arguably driving innovation into ever

faster networking technology and likely lower cost per bit.

2.2.7. Converged Networks

Another key factor to support scaling where IETF technologies that

allowed to multiplex different types of traffic (e.g., realtime vs.

non-realtime) which previously used separate networks with typically

incompatible networking technologies.

Eliminating multiple physical networks with separate routing/

forwarding nodes and separate links affords significant energy

savings even at the same generation of speed and hence energy/bit

simply by avoiding the N-fold production and operations of equipment

and links. Of course, originally the CAPEX and OPEX of multiple,

technology-diverse networks and host-stacks was the core reason for

unified networks, and energy saving is in hindsight just incidental

(as for all other cases mentioned here).

2.2.7.1. IntServ and DetNet

The first (non-IETF) wider adopted technology promising converged

networks was "Asynchronuous Transfer Mode" (ATM), which was designed

and deployed at the end of the 1980th to support specifically

multiplexing of "Data Voice and Video", where both Voice and Video

(at that time) required loss-free deterministic bounded latency and

low-jitter and had therefore their own Time-Division-Multiplex (TDM)

networks, both separate from so-called Data networks using packet

multiplexing. This technology was very expensive on a per-bit basis

due to its cell-forwarding nature though.

At the end of the 1980th, it was proven in [BOUNDED_LATENCY] that

variable length packet multiplexing in network can also support non-

NP-hard calculations for bounded latency. This lead to the IETF

"Integrated Services WG" (INTSERV) to support such guaranteed

throughput and bounded latency traffic via [RFC2212] - and to the

demise of ATM.

IntServ has so far seen little traction because it too got

superceeded as explained in the following section - for its original

use-cases (voice and video). However this type of services are being

revisited for a broader set of use-cases [RFC8575] in the DetNet WG,

which should enable even further network infrastructure convergence

for IoT and industrial markets.

2.2.7.2. DiffServ

Due to the much higher per-packet processing overhead of INSERV

versus standard (so-called Best-Effort) Internet traffic, the INTSERV

model was already recognized in the 1990th to not support highest-

scale at lowest cost, leading to the parallel development of the IETF

"Differentiated Services WG" (DIFFSERV) model defined in [RFC2475].

This has since then become the dominant technology to support

multiplexing of applications and services originally not designed for

the Internet onto a common TCP/IP network infrastructure,

specifically for voice and video over UDP ([RFC768]) including RTP

[RFC3550] and SIP.

2.2.7.3. SIP

SIP has most notably in the past two decades eliminated additional

network infrastructures previously required for (voice) telephony

services starting in the early 2000 with commercial/enterprise

deployments and today by removing even the option for any (non-IP/

SIP) analog or digital (ISDN) telephone service connection, instead

delivering those purely as services over adaptation interfaces on

home routers (TBD: Any RFC to cite for those tunneling/adaptation

services ?).

3. Higher or New Energy Consumption

Digitized, network centric workflows may consume more energy than

their non-digitized counterpart, as may new network centric workflows

without easy to compare prior workflows.

In one type of instances, the energy consumption on a per-instance

basis is lower than in the non-digitized/non-Internet-digitized case,

but the total number of instances that are (Internet)-digitized is

orders of magnitudes larger than their alternative options, typically

because of their higher utility or lower overall cost.

For example, each instance of (simple text) email consumes less

energy than sending a letter or postcard. Even streaming a movie or

TV series consumes less energy than renting a DVD DVDvsStreaming

(https://www.smithsonianmag.com/science-nature/streaming-movie-less-

energy-dvd-180951586). Nevertheless, the total amount of instances

and in result energy consumption for email and streaming easily

outranks their predecessor technologies.

While these instances look beneficial from a simple energy

consumption metric, its overall scale and the resulting energy

consumption may in itself become an issue, especially when the energy

demand it creates risks to outstrip the possible energy production,

short term or long term. This concern is nowadays often raised

against the "digital economy", where the network energy consumption

is typically cited as a small contributor relative to its

applications, such as what is running in Data Centers (DC).

In other cases, the energy consumption of digitization requires often

significantly more than their pre-digitization alternatives. The

most well-known example of this are likely crypto-currencies based on

"proof-of-work" computations (mining), which on a per currency value

unit can cost 10..30 times or more of the energy consumed by for

example gold mining (very much depending on the highly fluctuating

price of the crypto-currency). Nevertheless, its overall utility

compared to such prior currencies or valuables makes it highly

successful in the market.

In general, the digital economy tends to be more energy intensive on

a per utility/value unit, for example by replacing a lot of manual

labor with computation), and/or it allows for faster growth of its

workflows.

The lower the cost of network traffic, and the more easily accessible

everywhere network connectivity is, the more competitive and/or

successful most of these new workflows of the digital economy can be.

Given how TCP/IP based networks, especially the Internet have

excelled through their design principles (and success) in this

reduction of network traffic cost and ubiquitous access over the past

few decades, as outlined above, one can say that IETF technologies

and especially the Internet are the most important enabler of the

digital economy, and the energy consumption it produces.

4. Some Notes on Sustainability

Sustainability is the principle to utilize resources in a way that

they do not diminish or run out over the long term. Beyond the above

covered energy saving, sustainability relates with respect to the

IETF specifically to the use of renewable sources of energy to

minimize exhaustion of fossile resources, and the impact of IETF

technologies on global warming to avoid worsening living conditions

on the planet.

While there seems to be no IETF work specifically intending to target

sustainability (TBD: did we miss anything ?), the Internet itself can

similarly to how it does for digitization play a key role in building

sustainable networked IT infrastructures. The following subsections

list three examples areas where global high performance, low-cost

Internet networking is a key requirement.

4.1. Follow the Energy Cloud Scheduling

Renewable energy resources (except for water) do commonly have

fluctuating energy output. For example, solar energy output

correlates to night/day and strength of sunlight. Cloud Data Centers

(DC) consume a significant amount of the IT sectors energy. Some

workloads may simply be scheduled to consume energy in accordance

with the amount of available renewable energy at the time, not

requiring the network. Significant workloads are not elastic in

time, such as interactive cloud DC interactive work (cloud based

applications) or entertainment (gaming, etc.). These workloads may

be instantiated or even dynamically (over time) migrate to a DC

location with sufficient renewable energy and the Internet (or large

TCP/IP OTT backbone networks) will serve as the fabric to access the

remote DC and to coordinate the instantiation/migration.

4.2. Minimize Generated Heat

The majority of energy in cloud DC is normally also wasted as exhaust

heat, requiring even more energy for cooling. The warmer the

location, the more energy needs to be spent for cooling. For this

reason, DC in cooler climates such as https://greenmountain.no/power-

and-cooling/ can help to reduce the overall DC energy consumption

significantly (independent of the energy being consumed in the DC to

be renewable itself). The Internet again plays the role of providing

access to those type of DC whole location is not optimized for

consumption but for sustainable generation of compute and storage.

4.3. Heat Recovery

Exhaust heat, especially from compute in DC, can be recovered when it

is coupled to heating systems ranging in size all the way from

individual familys home through larger buildings (hotels etc.) all

the way to district heating systems. A provider of such type of

compute-generated heat as a service can sell the compute capacity as

long as there is cost efficient network connectivity. "Cloud & Heat"

is an example company offering such infrastructures and services

https://www.cloudandheat.com/wp-content/

uploads/2020/02/2020_CloudHeat-Whitepaper-Cost-saving-Potential.pdf.

4.4. Telecollaboration

Telecollaboration has a long history in the IETF resulting in

multiple core technologies over the decades.

If one considers textual communications via email and netwnews (using

e.g.: NNTP) as early forms of Telecollaboration, then

telecollaboration history through IETF technology reaches back into

the 1980th and earlier.

Around 1990, the IETF work on IP Multicast (e.g.[RFC1112] and later)

enabled the first efficient forms of audio/video group collaboration

through an overlay network over the Internet called the MBone

https://en.wikipedia.org/wiki/Mbone which was also used by the IETF

for more than a decade to provide remote collaboration for its own

(in-person + remote participation) meetings.

With the advent of SIP in the early 2000, commercial

telecollaboration started to be built most often on SIP based session

and application protocols with multiple IETF working groups

contributing to that protocol suite (TBD: how much more example/

details should we have here). Using this technology and the

Internet, the immersive nature of telecollaboration was brought to

life-size video, was/is called Telepresence

https://en.wikipedia.org/wiki/Telepresence and later to even more

immersive forms such as AR/VR telecollaboration.

In 2011, the IETF opened the "Real-Time Communication in WEB-

browsers" (RTCWEB) WG, that towards the end of that decade became the

most widely supported cross-platform standard for hundreds of

commercial and free tele-collaboration solutions, including Cisco

Webex, which is also used by the IETF itself, Zoom and the new IETF

collaboration suite MeetEcho (TBD: good references here ?).

While the various forms of Telecollaboration are mostly instances of

digitization, they are discussed under sustainability because of its

comparison to in-person travel that is not based on simple comparison

of energy, but nowadays by comparing their impact on global warming,

a key factor to sustainability.

Telecollaboration was primarily developed because of the utility for

the participants - to avoid travel for originally predominantly

business communications/collaborations. It saw an extreme increase

in use (TBD: references) in the Corona Crisis of 2019, when

especially international travel was often prohibited, and often even

working from an office. This forced millions of people to work from

home and utilizing commercial telecollaboration tools. It equally

caused most in-person events that where not cancelled to be moved to

a telecollaboration platform over the Internet - most of them likely

relying on RTCWEB protocols.

Actual energy consumption related comparison between teleconferencing

and in-person travel is complex but since the last decades is

commonly based on calculating some form of CO2 emission equivalent of

the energy consumed, hence comparing not simply the energy

consumption, but weighing it by the impact the energy consumption has

on one of the key factors (CO2 emission) known to impact sustainable

living conditions.

[VC2014] is a good example of a comparison between travel and

telecollaboration taking various factors into account and using CO2

emission equivalents as its core metric. That paper concludes that

carbon/ energy cost of telecollaboration could be as little as 7% of

an in-person meeting. in-person meeting. Those numbers have various

assumptions and change when time-effort of participants is converted

to carbon/energy costs. These numbers should even be better today in

favor of telecollaboration: cost of Internet traffic/bit goes down

while cost of fossile fuel for travel goes up.

Recently, air travel has also come under more scrutiny because the

greenhouse gas emissions of air travel at the altitudes used by

commercial aviation has been calculated to have a higher global

warming impact than simply the amount of CO2 used by the air plane if

it was exhausted at surface level. One publicly funded organization

offering carbon offset services calculates a factor 3 of the CO2

consumption of an air plane

https://www.atmosfair.de/de/fliegen_und_klima/flugverkehr_und_klima/

klimawirkung_flugverkehr/.

In summary: Telecollaboration has a higher sustainability benefit

compared to travel than just the comparison of energy consumption

because of the higher challenge to use renewable energy in

transportation than in networking, and this is most extreme in the

case of telecollaboration that replaces air travel because of the

even higher global warming impact of using fossile fuels in air

travel.

5. Energy Optimization in Specific Networks

5.1. Low Power and Lossy Networks (LLN)

Low Power and Lossy Networks are networks in which nodes and/or radio

links have constraints. Low power consumption constraints in nodes

often originate from the need to operate nodes from as long as

possible from battery and/or energy harvesting such as (today most

commonly) solar panels associated with the node or ambient energy

such as energy harvesting from movement for wearable nodes or piezo

cells to generate energy for mechanically operated nodes such as

switches.

Several IETF WGs have or are producing work is primarily intended wo

support LLN through multiple layers of the protocol stack. [RFC8352]

gives a good overview of the energy consumption related communication

challenges and solutions produced by the IETF for this space.

To minimize the energy needs for such nodes, their network data-

processing mechanisms have to be optimized. This includes packet

header compression, fragmentation (to avoid latency through large

packets at low bitrates, packet bundling to only consume radio energy

at short time periods, radio energy tuning to just reach the

destination(s), minimization of multicasting to eliminate need of

radio receivers to consume energy and so on. [RFC8352] gives a more

detailed overview, especially because different L2 technologies such

as IEEE 802.15.4 type (low power) wireless networks, Bluetooth Low

Energy (BLE), WiFi (IEEE 802.11) and DEC ULE.

In the INTernet area of the IETF, several LLN specific WGs exist(ed):

5.1.1. 6LOWPAN WG

The "IPv6 over Low power WPAN (Wireless Personal Area Networks)"

(6lowpan) WG ran from 2005 to 2014 and produced 6 RFC that adopt IPv6

to IEEE 802.15.4 type (low power) wireless networks by transmission

procedures [RFC4949], compression of IPv6 (and transport) packet

headers [RFC6282], modifications for neighbor discovery (ND)

[RFC6775], as well as 3 informational RFCs about the WPAN space and

applying IPv6 to it. "Transmission of IPv6 Packets over IEEE

802.15.4 Networks" [RFC4944], "Compression Format for IPv6 Datagrams

over IEEE 802.15.4-Based Networks" [RFC6282], "Neighbor Discovery

Optimization for IPv6 over Low-Power Wireless Personal Area Networks

(6LoWPANs)" [RFC6775] (6LOWPAN-ND).

5.1.2. LPWAN WG

Since 2014, the "IPv6 over Low Power Wide-Area Networks" (LPWAN) WG

has produced 4 RFC for low-power wide area networks, such as LoRaWAN

https://en.wikipedia.org/wiki/LoRa, with three standards, [RFC8724],

[RFC8824], [RFC9011].

5.1.3. 6TISCH WG

Since 2013, the "IPv6 over the TSCH mode of IEEE 802.15.4e" (6tisch)

WG has produced 7 RFC for a version of 802.15.4 called the "Time-

Slotted Channel Hopping Mode" (TSCH), which supports deterministic

latency and lower energy consumption through the use of scheduling

traffic into well defined time slots, thereby also optimizing/

minimizing energy consumption when compared to 802.15.4 without TSCH.

5.1.4. 6LO WG

Since 2013, the "IPv6 over Networks of Resource-constrained Nodes"

(6lo) WG has generalized the work of 6lowpan for LLN in general,

producing 17 RFC for IPv6-over-l2foo adaptation layer specifications,

information models, cross-adaptation layer specification (such as

header specifications) and maintenance and informational documents

for other pre-existing IETF work in this space.

5.1.5. ROLL WG

In the RouTinG (RTG) area of the IETF, the "Routing Over Low power

and Lossy networks" (ROLL) WG has produced since 2008 23 RFC.

Initially it produced requirement RFCs of different type of "Low-

power and Lossy Networks": urban: [RFC5548], industrial [RFC5673],

home automation [RFC5826] and building automation [RFC5867].

Since then its work is mostly focused on the "IPv6 Routing Protocol

for Low-Power and Lossy Networks" (RPL) [RFC6550], which is used in a

wide variety of the above described IPv6 instances of LLN networks

and which are discussed in two ROLL applicability statement RFCs,

"Applicability Statement: The Use of the Routing Protocol for Low-

Power and Lossy Networks (RPL) Protocol Suite in Home Automation and

Building Control" [RFC7733] and "Applicability Statement for the

Routing Protocol for Low-Power and Lossy Networks (RPL) in Advanced

Metering Infrastructure (AMI) Networks" [RFC8036].

The ROLL WG also wrote a more generic RFC for LLN, "Terms Used in

Routing for Low-Power and Lossy Networks" [RFC7102]. RPL has a

highly configurable set of functions to support (energy) constrained

networks. Unconstrained root node(s), typically edge routers between

the RPL network and a backbone network calculate "Destination-

Oriented Directed Acyclic Graphs" (DODAG) and can use strict hop-by-

hop source routing with dedicated IPv6 routing headers [RFC9008] to

minimize constrained nodes routing related compute and memory

requirements. "The Trickle Algorithm" [RFC6206] allows to minimize

routing related packets through automatic lazy updates. While RPL is

naturally a mesh network routing protocol, where all nodes are

usually expected to be able to participate in it, RPL also supports

even more lightweight leave nodes [RFC9010].

The 2013 [I-D.ajunior-energy-awareness-00] proposes the introducing

of energy related parameters into RPL to support calculation/

selection of most energy efficient paths. The 2017 "An energy

optimization routing scheme for LLSs",

[I-D.wang-roll-energy-optimization-scheme] observed that DODAGs in

RPL tend to require more energy in nodes closer to the root and

proposed specific optimizations to reduce this problem. Neither of

these drafts proceeded in the IETF.

While original use-cases for RPL where energy and size limited

networks, its design is to a large extend not scale limited. Because

of this, and due to its reduced compute/memory requirements for the

same size networks compared to other routing protocols, especially

the so-called link-state "Interior Gateway routing Protocols" (IGP),

such as most commonly used protocols ISIS [RFC1142] and OSPF

[RFC2328], RPL has also proliferated into use-cases for non-

constrained networks, for example to support the largest possible

networks automatically, such as in [RFC8994].

5.2. Constrained Nodes and Networks

(Power) constrained nodes and/or networks exist in a much broader

variety than coupled with low-power and lossy networks. For example

WiFi and mobile network connections are not considered to be lossy

networks, and personal mobile nodes with either connections are order

of magnitude less constrained than nodes typically attached to LLN

network. Therefore, broader work in the IETF than focused primarily

on LLN typically uses just the term lightweight or constrained (nodes

and networks).

5.2.1. LWIG WG

Since 2013, the "Light-Weight Implementation Guidance" (lwig) WG is

has produced 6 informational RFC on the groups subject, much of which

indirectly supports implementing power efficient network

implementations via lightweight nodes/links, but it also addressed

the topic explicitly including via the aforementioned [RFC8352] and

[RFC9178], "Building Power-Efficient Constrained Application Protocol

(CoAP) Devices for Cellular Networks".

5.2.2. CoRE and CoAP

In the APPlication (APP) area of the IETF, the "Constrained RESTful

Environments" (core) WG has produced since 2010 21 RFC, most of them

for or related to "The Constrained Application Protocol" (CoAP)

[RFC6690], which can best be described as a replacement for HTTP for

constrained environment, using UDP instead of TCP and DTLS instead of

TLS, compact binary message formats instead of human readable textual

formats, RESTful message exchange semantic instead of a broader set

of options (in HTTP), but also more functionality such as (multicast)

discovery and directory services, therefore providing a more

comprehensive set of common application functions with more compact

on-the-wire/radio encoding than its unconstrained alternatives.

"Object Security for Constrained RESTful Environments" (OSCORE),

[RFC8613] is a further product of the CoRE WG providing a more

message layer based, more lightweight security alternative to DTLS.

While originally designed for LLN, CoAP is transcending LLN and

equally becoming standards in unconstrained environments such as

wired/ethernet industrial Machine 2 Machine (M2M) communications,

because of simplicity, flexibility and relying on the single set of

protocols supporting the widest range of deployment scenarios.

In the SECurity (SEC) area of the IETF, the "Authentication and

Authorization for Constrained Environments" (ace) working group has

since 2014 produced 4 RFC for security functions in constrained

environments, for example CoAP based variations of prior HTTPS

protocols such as EST-coaps [RFC9148] for HTTPS based EST [RFC7030].

Constrained node support in cryptography especially entails support

for Elliptic Curve (EC) public keys due to their shorter key sizes

and lower compute requirements compared to RSA public keys with same

cryptographic strength. While the benefits of EC over RSA where

making them preferred, this "additional market space" (constrained

node) benefit helped in their faster market proliferation even beyond

constrained networks.

5.2.3. Satellite Constellations

Emerging communication infrastructures may have specific requirements

on power consumption. Such requirements should be taken into account

when designing/customizing techniques (e.g., routing) to be enabled

in such networks. For example,

[I-D.lhan-problems-requirements-satellite-net] identifies a set of

requirements (including power) for satellite constellations.

5.2.4. Devices with Batteries

Many IETF protocols (e.g., [RFC3948]) were designed to accommodate

the presence of middleboxes mainly by encouraging clients to issue

frequent keepalives. Such strategy has implication on battery-

supplied devices. In order to optimize battery consumption for such

devices, [RFC6887] specifies a deterministic method so that client

can control state in the network, including their lifetime.

Keepalive alive messages may this be optimized as a function of the

network policies.

A_REC#2 of [RFC7849] further insist on the importance of saving

battery exacerbated by keep-alive messages and recommends the support

of collaborative means to control state in the network rather than

relying on heuristics.

5.3. Sample Technical Enablers

5.3.1. (IP) Multicast

5.3.1.1. Power Saving with Multicast

IP Multicast was introduced with [RFC1112] and today also called "Any

Source Multicast" (ASM) has various protocols standardized in the

IETF across multiple working groups. There are also MPLS and BIER

multicast protocols from the IETF developed in the equally named WGs.

These three, network layer multicast technologies can be a power

saving technologies when used to distribute data because they reduce

the number of packets that need to be sent across the network

(through in-network-replication where needed). Because most current

link and router technologies do not allow to actually save

significant amounts of energy on lower than maximum utilization,

these benefits are often only theoretical though. Software routers

are the ones most likely to expose energy consumption somewhat

proportional to their throughput for just the forwarding (CPU) chip.

Likewise, in large backbone networks, IP multicast can free up

bandwidth to be used for other traffic, such as unicast traffic,

which may allow to avoid upgrades to faster and potentially more

power consuming routers/links. Today, these benefits too are most

often overcompensated for by lower per-bit energy consumption of

newer generations of routers and links though.

Multicasting can also save energy on the transmitting station across

radio links, compared to replicated unicast traffic, but this is

rarely significant, because except for fully battery powered mesh

network, there are typically non-energy-constrained nodes, such as

(commonly) the wired access-points in WiFi networks.

In result, today multicasting has typically no significant power

saving benefits with available network technologies. Instead it is

used (for data distribution) when the amount of traffic that a

unicast solution alternative (with so-called ingress replication) is

not possible due to the total amount of traffic generated. This

includes wireless/radio networks, where equally airtime is the

limiting factor.

5.3.1.2. Power Waste Through Multicast-based Service Coordination

(IP) multicast is often not used to distribute data requested by

receivers, but also coordination type functions such as service or

resource announcement, discovery or selection. These multicast

messages may not carry a lot of data, but they cause recurring, often

periodic packets to be sent across a domain and waste energy because

of various ill-advised designs, including, but not limited to the

following issues:

(a) The receivers of such packets may not even need to receive them,

but the protocol shares a multicast group with another protocol that

the client does need to receive.

(b) The receiver should not need to receive the packet as far as

multicast is concerned, but the underlying link-layer technology

still makes the receiver consume the packet at link-layer.

(c) The information received is not new, but just periodically

refreshed.

(d) The packet was originated for a service selection by a client,

and the receiving device is even responding, but the client then

chooses to select another device for the service/resource.

These problems are specifically problematic in the presence of so-

called "sleepy" nodes Section 5.3.2 that need to wake up to receive

such packets (unnecessarily). It is worse, when the network itself

is an LLN network where the forwarders themselves are power

constrained and for example periodic multicasting of such

coordination packets wastes energy on those forwarders as well -

compared to better alternatives.

In 2006, the IETF standardized "Source Specific Multicast" (SSM)

[RFC4607], a variation of IP Multicast that does not allow to perform

these type of coordination functions but is only meant for (and

useable for) actual data distribution. SSM was introduced for other

reasons than the above-described power related issues though, but

deprecating the use of ASM is one way to avoid/minimize its ill-

advised use with these type of coordination functions, when energy

efficiency is an issue. [RFC8815] is an example for deprecating ASM

for other reasons in Service Provider networks.

5.3.1.3. Multicast Problems in Wireless Networks

[RFC9119] covers multicast challenges and solutions (proposals) for

IP Multicast over Wi-Fi. With respect to power consumption, it

discusses the following aspects:

(a) Unnecessary wake-up of power constrained Wi-Fi Stations (STA)

nodes can be minimized by wireless Access Points (APs) that buffer

multicast packets so they are sent only periodically when those nodes

wake up.

(b) WiFi access points with "Multiple Input Multiple Output" (MIMO)

antenna diversity focus sent packets in a way that they are not

"broadcast" to all receivers within a particular maximum distance

from the AP, making WiFi multicast transmission even less desirable.

(c) It lists the most widely deployed protocols using aforementioned

coordination via IP multicast and describes their specific challenges

and possible improvements.

(d) Existing proprietary conversion of WiFi multicast to Wi-Fi

unicast packets.

[I-D.desmouceaux-ipv6-mcast-wifi-power-usage] focuses on IPv6-related

concerns of multicast traffic in large wireless network. This

document provides as set of statistics and the induced device power

consumption of such flows.

5.3.2. Sleepy Nodes

Sleepy nodes are one of the most common design solutions in support

of power saving. This includes LLN level constrained nodes, but also

nodes with significant battery capacity, such as mobile phones,

tablets and notebooks, because battery lifetime has long since been a

key selling factor. In result, vendors do attempt to optimize power

consumption across all hardware and software components of such

nodes, including the interface hardware and protocols used across the

nodes WiFi and mobile radios.

Restating from [I-D.bormann-core-roadmap-05]: CoAP has basic support

for sleepy nodes by allowing caching of resource information in (non-

sleepy) proxy nodes. [RFC7641] enhances this support by enabling

sleepy nodes to update caching intermediaries on their own schedule.

Around 2012/2013, there was significant review of further review of

further support for sleepy nodes in CoAP, resulting in a long list of

drafts, whose sleepy nodes benefits are discussed in

[I-D.bormann-core-roadmap-05]: [I-D.vial-core-mirror-server],

[I-D.vial-core-mirror-proxy], [I-D.fossati-core-publish-option],

[I-D.giacomin-core-sleepy-option], [I-D.castellani-core-alive],

[I-D.rahman-core-sleepy-problem-statement], [I-D.rahman-core-sleepy],

[I-D.rahman-core-sleepy-nodes-do-we-need],

[I-D.fossati-core-monitor-option]. None of these drafts proceeded

though.

One partial solution to some sleepy node issues related to their

energy consumption, especially the ones caused by the use of

multicast Section 5.3.1.2, Section 5.3.1.3 is the use of the

"Constrained RESTful Environments (CoRE) Resource Directory" (CoRE-

RD) [RFC9176]. It allows for sleepy nodes to register discover and

register resources via unicast and avoids waking up sleepy nodes when

they are not selected by a resouce consumer.

An partial alternative to CoRE-RD is the "DNS-Based Service

Discovery" {DNS-SD} [RFC6763] combined with for example "Service

Registration Protocol for DNS-Based Service Discovery"

[I-D.ietf-dnssd-srp]. Services can be seen as a subset of resources,

and in networks where DNS has to be supported anyhow for other

reasons, DNS-SD may be a sufficient alternative to CoRE-RD. It is

used for example in Thread https://en.wikipedia.org/wiki/

Thread_(network_protocol) for this purpose and the only multicast

based coordination is the one to establish network wide parameters,

such as the address(es) of DNS-SD server(s).

"Building Power-Efficient Constrained Application Protocol (CoAP)

Devices for Cellular Networks" [RFC9178] discusses sleepy devices,

especially the use of CoAP PubSub [I-D.ietf-core-coap-pubsub] as a

mechanism to build proxies for sleepy devices. "Sensor Measurement

Lists (SenML)", Standardized proxy infrastructures are best built

with standard data models, such as "Sensor Measurement Lists" (SenML)

[RFC8428] for sensors, likely the largest number of sleepy devices,

especially in LLN.

"Reducing Energy Consumption of Router Advertisements", [RFC7772]

eliminates/reduces the energy impact for sleepy nodes of the

ubiquitous IPv6 "Neighbor Discovery" (ND) protocol by giving

recommends for replacing multicast "Router Advertisement" (RA)

messages with so-called directed unicast versions, therefore not

waking up sleepy nodes (with an IP multicast RA message). This was

already allowed in ND [RFC4861], but not recommended as the default.

Note that [RFC7772] does not provide all the energy related

optimizations of ND as developed by 6LoWPAN through [RFC6775].

[I-D.chakrabarti-nordmark-energy-aware-nd] proposes generalizations

for those applications for to all IPv6 links, but was not further

pursued by the IETF so far.

5.4. (Lack of) Power Benchmarking Proposals

[I-D.petrescu-v6ops-ipv6-power-ipv4] presented some measurement

results of the power consumption when using IPv6 vs IPv4 with a focus

on mobile devices. Such measurements are not backed with formal

benchmarking methodologies so that solid and reliable references are

set to compare and interpret data.

https://www.ietf.org/proceedings/103/slides/slides-103-saag-iot-

benchmarking-00 presented a benchmark example but with a focus on

power cost of encryption.

6. Energy Management Networks

Use of IETF protocol networks in networks that operate power

consumption and production is another broad area of digitization.

6.1. Smart Grid

"Smart Grid" is the most well-known instance of such energy

management networks. According to https://en.wikipedia.org/wiki/

Smart_grid, the term covers aspects mostly centered around

intelligent measured and controlled consumption of energy. This

includes "Advanced Metering Infrastructure" / "Smart Meters", remote

controllable "distribution boards", "circuit breakers", "load

control" and "smart appliances". Use cases for the "Smart Grid"

include for example timed and measured operations of home devices

such as washers or charging cars, when energy consumption is below

average.

The 2011 "Internet Protocols for the Smart Grid" [RFC6272] is a quite

comprehensive (66 page) overview of all IETF protocols considered to

be necessary or beneficial for Smart Grid networks. This document

was written in response to interest by the (not-yet-smart grid)

community in utilizing the IETF TCP/IP technologies to evolve

previously non-TCP/IP network, and the risk that unnecessary

reinvention of the wheel/protocols would be done by that community

instead of reusing what was already well specified by the IETF.

Most of the overview in this document is not specific to networks

used for Smart Grid applications but just summarized in the document

for the above described outreach and education to the community. The

aspects most specific to Smart Grids is the back in 2011 still

somewhat in its infancy adaptation of IPv6 network technologies to

LLN networks (see Section 5.1 below): smart meters, circuit breakers,

load measurement devices, car chargers and so on - all those devices

would most likely be connected to the network via a low-power radio

networks, which ideally would utilize IPv6 directly. Support for LLN

networks with IPv6 has well improved in IETF specifications in the

past decade.

6.2. Syncro Phasor Networks

Power output of multiple power plants/generators into the same power

grid needs to be synchronized by power levels based on consumption

and power phase (50/60Hz depending on continent) to avoid that energy

created out-of-phase is not only wasted, but would actually burn out

power lines or create permanent damage in power generators. When

generators go out-of-sync, they have to be emergency switched off,

resulting in (rolling-)blackouts, worsening the conditions beyond its

likely root-cause such as a single overloaded limited region.

Syncro Phasor Networks are networks whose goal it is to support

synchronization of power generators across a power grid, ultimately

also permitting to build larger and more resilient power grids.

"Power Measurement Units" (PMU) are their core sensoring elements.

Since about 2012? these networks have started to move from

traditional SCADA towards more TCP/IP based networking and

application technologies "to improve power system reliability and

visibility through wide area measurement and control, by fostering

the use and capabilities of synchrophasor technology"

(www.naspi.org).

With their fast control loop reaction time and measurement

requirements, they also benefit from reliable, fast propagation of

PMU data as well as stricter clock synchronization than most Smart

Grid applications. For example, transmission lines expand under heat

that s caused by electrical load and/or environmental temperature by

as much as 30% (between coldest and hottest or highest-load times),

impacting the necessary phase relationship of power generation on

either end (speed of light propagation speed based on effective

length of contracted/expanded wire).

The length of transmission wires can be measured from data sent

across the transmission lines and measuring their propagation latency

with the help of accurate clock synchronization between sender and

receiver(s), using for example network-based clock synchronization

protocols. The IETF "Network Time Protocol version 4" (NTPv4),

[RFC5905] is one option for this. The IEEE PTP protocol is often

preferred though because it specifies better how measurements can be

integrated at the hardware level of Ethernet interfaces, thus

allowing easier to achieve higher accuracy, such as Maximum Time

Interval (MTIE) errors of less than 1 msec. See for example

[NASPICLOCK].

The "North American Syncro Phasor Initiative" (NASPI),

https://www.naspi.org is an example organization in support of syncro

phasor networking. It is an ongoing project by the USA "Department

of Energy" (DoE).

7. (Limited) Energy Management for Networks

7.1. Some Metrics

A 2010-2013 draft [I-D.manral-bmwg-power-usage], which was not

adopted discussed and proposed metrics for power consumption that

where intended to be used for benchmarking.

The later work in Section 7.2 referred instead to other metrics for

measuring power consumption from other SDOs.

A 2011-2012 draft [I-D.jennings-energy-pricing], which was not

adopted, discusses and proposes a data model to communicate time-

varying cost of energy in support of enabling time-shifting of

network attached or managed equipment consumption of power.

7.2. EMAN WG

While the IETF did specify a few MIBs with aspects related to of

power management, it was only with the formation of the "Energy

Management" (EMAN) WG which ran from 2010 to 2015 and released 7 RFC,

that the IETF produced a comprehensive set of MIB based standards for

managing energy/power for network equipment and associated devices

and integrated prior scattered power management related work in the

EMAN produced (solely) a set of data/information models (MIBs). It

does not introduce any new protocol/stacks nor does it address

"questions regarding Smart Grid, electricity producers, and

distributors" (from [RFC7603]).

[I-D.claise-power-management-arch] describes the initial EMAN

architecture as envisioned by some of the core contributors to the

WG. It was rewritten in EMAN as the "Energy Management Framework"

[RFC7326]. "Requirements for Energy Management" are defined in

[RFC6988].

According to [RFC7326], "the (EMAN) framework presents a physical

reference model and information model. The information model

consists of an Energy Management Domain as a set of Energy Objects.

Each Energy Object can be attributed with identity, classification,

and context. Energy Objects can be monitored and controlled with

respect to power, Power State, energy, demand, Power Attributes, and

battery. Additionally, the framework models relationships and

capabilities between Energy Objects."

One category of use-cases of particular interest to network equipment

vendors was and is the management of "Power over Ethernet" via the

EMAN framework, measuring and controlling ethernet connected devices

through their PoE supplied power. Besides industrial, surveillance

cameras and office equipment, such as WiFi access points and phones,

PoE is also positioned as a new approach for replacing most in-

building automation components including security control for doors/

windows, as well as environmental controls and lighting through the

use of an in-ceiling, PoE enabled IP/ethernet infrastructure.

EMAN produced version 4 of the "Entity MIB" (ENTITY-MIB) [RFC6933],

primarily to introduce globally unique UUIDs for physical entities

that allows to better link across different entities, such as a PoE

port on an ethernet switch and the device connected to that switch

The "Monitoring and Control MIB for Power and Energy" [RFC7460]

specifies a MIB for monitoring for Power State and energy consumption

of networked. The document discusses the link with other MIBs such

as the ENTITY-MIB, the ENTITY-SENSOR-MIB [RFC3433] for which it is

amending missing accuracy information to meet IEC power monitoring

requirements, the "Power Ethernet MIB" (POWER-ETHERNET-MIB) [RFC3621]

to manage PoE, and the pre-existing IETF MIB for Uninterruptable

Power Supplies (UPS) (UPS-MIB) [RFC1628], allowing for example to

build control systems that manage shutdowns of devices in case of

power failure based on UPS battery capacity and device consumptions/

priorities. Similarly, the EMAN "Definition of Managed Objects for

Battery Monitoring" [RFC7577] defines objects to support battery

monitoring in managed devices.

The pre-existing IETF "Entity State MIB" (ENTITY-STATE-MIB) [RFC4268]

allows to specify the operational state of entities specified via the

ENTITY-MIB respective to their power consumption and operational

capabilities (e.g.: "coldStandby", "hotStandby", "ready" etc.).

Devices can also act as proxies to provide a MIB interfaces for

monitoring and control of power for other devices, that may use other

protocols, such as in case of a home gateway interfacing with various

vendor specific protocols of home equipment.

The EMAN "Energy Object Context MIB" [RFC7461] defines the ENERGY-

OBJECT-CONTEXT-MIB and IANA-ENERGY-RELATION-MIB, both of which serve

to "address device identification, context information, and the

energy relationships between devices" according to [RFC7461].

To automatically discover and negotiate PoE power consumption between

switch and client, non-IETF technologies, such as IEEE "Link Layer

Discovery Protocol" (LLDP) and proprietary MIBs for it, such as LLDP-

EXT-MED-MIB can be used.

Finally, the "Energy Management (EMAN) Applicability Statement"

[RFC7603] provides an overview of EMAN with a user/operator

perspective, also reviewing a range of typical scenarios it can

support as well as how it could/can link to a variety of pre-

existing, non-IETF standards relevant for power management. Such

intended applicability includes home, core, and DC networks.

There are currently no YANG equivalent modules. Such modules would

not only be designed to echo the EMAN MIBs but would also allow to

control dedicated power optimization engines instead of relying upon

static and frozen vendor-specific optimization.

8. Power-awareness in Forwarding and Routing Protocols

8.1. Power Aware Networks (PANET)

In 2013/2014, some drafts proposed how networks themselves,

specifically those of Internet Service Providers (ISP) could become

"power aware" to the extent that its power consumption could be

regulated (or self-regulate) based on the current required

performance of the network and/or available power, by reducing excess

(or too power expensive) network capacity through switching-off/low-

powering components such as redundant routers, linecards, interfaces

or links, or reducing power consumption by reducing bitrates on

links.

The 2013 "Power-Aware Networks (PANET): Problem Statement"

[I-D.zhang-panet-problem-statement] gives an overview of this

concept, and so does "Power-aware Routing and Traffic Engineering:

Requirements, Approaches, and Issues", [I-D.zhang-greennet] from the

same year.

The 2014 [I-D.retana-rtgwg-eacp] exemplifies the concept and

discusses key challenges such as the reduced resilience against

errors when redundant components are switched off, the risk of

increased stretch (path length) and therefore latency under partial

network component shutdown or downspeeding, as well as the idea of

saving energy through (periodic) microsleeps such as possible with

"Energy Efficient Ethernet" https://en.wikipedia.org/wiki/Energy-

Efficient_Ethernet links. The 2013 draft "Reducing Power Consumption

using BGP with power source data",

[I-D.mjsraman-panet-inter-as-power-source] proposed BGP attributes to

allow calculation of power efficient (or for example green) paths.

One core market driver for this work where rolling blackouts that

especially affected India at the time of these drafts, raising the

desire to be for example reducing the total power consumption of a

network in times of such energy emergencies.

While there was technical interest in the IETF, the market

significance for the vendors mostly present in the IETF was

considered as not to be important enough. Likewise, traditional

routers, unlike for example todays standard PC hardware designs do

exhibit little power savings upon shutdown of components such as

line-cards or interfaces.

In addition, an SDN / controller-based solution where relatively in

their infancy back in 2013/2014, and technologies that would allow

for SDN controller to have resilient (self-healing) connectivity such

as described in [RFC8368]/[RFC8994] was also not available, making

the risk of severely impacting network reliability one of the key

factors for this PANET work to not proceed so far.

8.2. SDN-based Semantic Forwarding

Recently, [I-D.boucadair-irtf-sdn-and-semantic-routing] provided the

following feature as an examples of capabilities that can be offered

by appropriate control of forwarding elements:

Energy-efficient Forwarding: An important effort was made in the past

to optimize the energy consumption of network elements. However,

such optimization is node-specific and no standard means to optimize

the energy consumption at the scale of the network have been defined.

For example, many nodes (also, service cards) are deployed as

backups.

A controller-based approach can be implemented so that the route

selection process optimizes the overall energy consumption of a path.

Such a process takes into account the current load, avoids waking

nodes/cards for handling "sparse" traffic (i.e., a minor portion of

the total traffic), considers node-specific data (e.g., [RFC7460]),

etc. This off-line Semantic Routing approach will transition

specific cards/nodes to "idle" and wake them as appropriate, etc.,

without breaking service objectives. Moreover, such an approach will

have to maintain an up-to-date topology even if a node is in an

"idle" state (such nodes may be removed from adjacency tables if they

don’t participate in routing advertisements).

8.3. Misc

The non-adopted, expired 2013 draft

[I-D.okamoto-ccamp-midori-gmpls-extension-reqs] discusses power

awareness in routing in conjunction with Traffic Engineering

(tunnels), specifically in the context of Generalized MPLS (GMPLS),

e.g.: varous L2 technologies such as switched optical fiber networks.

It primarily claims the issue that the existing management objects

are not sufficient to express energy management related aspects, and

thus do not allow to build energy conscious policies into PCE for

such GMPLS networks.

The non-adopted 2013 "Requirements for an Energy-Efficient Network

System", [I-D.suzuki-eens-requirements] proposes a signaling of

network capacity towards DC, for example based on load or network

energy management in support of appropriate performance control (such

as VM migration) the DC - or vice versa (DC load-based traffic

engineering in the network to support that DC load).

The non-adopted 2013 "Building power optimal Multicast Trees"

[I-D.mjsraman-rtgwg-pim-power] proposes that (PIM based) IP Multicast

routing could perform local routing choices in the case of "Equal

Cost MultiPath" (ECMP) "Reverse Path Forwarding" (RPF) alternatives

based on the energy that would be consumed in the router, such as

when one ECMP alternative would use a more power efficient linecard

or when one ECMP choice was on the same linecard as the interfaces to

which the packets would need to be routed (and therefore avoiding to

forward the packet across separate ingress and egress linecards).

9. Gaps

The 2013 "Towards an Energy-Efficient Internet"

[I-D.winter-energy-efficient-internet] summarizes some of the same

work items as this document (as written back in 2013) and lists

additional more non-adopted drafts. It also identifies three areas

of gaps, that it suggests the IETF to work on: "Load-adaptive

Resource Management", "Energy-efficient Protocol Design" and "Energy-

efficiency Metrics and Standard Benchmarking Methodologies".

Some aspects for those areas of gaps where partially tackled by later

work in the IETF, but broadly speaking, most of those areas remain

open to a wide range of possible further IETF/IRTF work.

10. Summary

11. Changelog

[RFC-Editor: this section to be removed in final document.]

The master for this document is hosted at http://github.com/toerless/

energy. Please submit Issues and/or Pull-requests for proposed

changes or join the team of authors and edit yourself.

00: Initial version

01: Added Co-author (Mohamed Boucadair) - long list of typo fixes,

editorial improvements in abstract, introduction and other chapters.

Added section on satellite networks, devices with batteries, power

benchmarking and SDN-based forwarding semantics.

02: Minor text edits (med), add pointer to additional draft (med),

Added co-author pascal (tte),

03: Aded Jeff Tentsura as co-author

[BOUNDED_LATENCY]

Cruz, R.L., "A calculus for network delay. I. Network

elements in isolation", DOI 10.1109/18.61109,

IEEE Transactions on Information Theory ( Volume: 37,

Issue: 1), 1991,

[I-D.ajunior-energy-awareness-00]

Junior, A. and R. C. Sofia, "Energy-awareness metrics

global applicability guidelines", Work in Progress,

Internet-Draft, draft-ajunior-energy-awareness-00, 16

October 2012, <https://www.ietf.org/archive/id/draft-

ajunior-energy-awareness-00.txt>.

[I-D.bormann-core-roadmap-05]

Bormann, C., "CoRE Roadmap and Implementation Guide", Work

in Progress, Internet-Draft, draft-bormann-core-roadmap-

05, 21 October 2013, <https://www.ietf.org/archive/id/

draft-bormann-core-roadmap-05.txt>.

[I-D.boucadair-irtf-sdn-and-semantic-routing]

Boucadair, M., Trossen, D., and A. Farrel, "Considerations

for the use of SDN in Semantic Routing Networks", Work in

Progress, Internet-Draft, draft-boucadair-irtf-sdn-and-

semantic-routing-01, 31 May 2022,

<https://www.ietf.org/archive/id/draft-boucadair-irtf-sdn-

and-semantic-routing-01.txt>.

[I-D.castellani-core-alive]

Castellani, A. P. and S. Loreto, "CoAP Alive Message",

Work in Progress, Internet-Draft, draft-castellani-core-

alive-00, 29 March 2012, <https://www.ietf.org/archive/id/

draft-castellani-core-alive-00.txt>.

[I-D.chakrabarti-nordmark-energy-aware-nd]

Chakrabarti, S., Nordmark, E., and M. Wasserman, "Energy

Aware IPv6 Neighbor Discovery Optimizations", Work in

Progress, Internet-Draft, draft-chakrabarti-nordmark-

energy-aware-nd-02, 12 March 2012,

<https://www.ietf.org/archive/id/draft-chakrabarti-

nordmark-energy-aware-nd-02.txt>.

[I-D.claise-power-management-arch]

Claise, B., Parello, J., and B. Schoening, "Power

Management Architecture", Work in Progress, Internet-

Draft, draft-claise-power-management-arch-02, 22 October

2010, <https://www.ietf.org/archive/id/draft-claise-power-

management-arch-02.txt>.

[I-D.desmouceaux-ipv6-mcast-wifi-power-usage]

Desmouceaux, Y., "Power consumption due to IPv6 multicast

on WiFi devices", Work in Progress, Internet-Draft, draft-

desmouceaux-ipv6-mcast-wifi-power-usage-01, 1 August 2014,

<https://www.ietf.org/archive/id/draft-desmouceaux-ipv6-

mcast-wifi-power-usage-01.txt>.

[I-D.fossati-core-monitor-option]

Fossati, T., Giacomin, P., and S. Loreto, "Monitor Option

for CoAP", Work in Progress, Internet-Draft, draft-

fossati-core-monitor-option-00, 9 July 2012,

<https://www.ietf.org/archive/id/draft-fossati-core-

monitor-option-00.txt>.

[I-D.fossati-core-publish-option]

Fossati, T., Giacomin, P., and S. Loreto, "Publish Option

for CoAP", Work in Progress, Internet-Draft, draft-

fossati-core-publish-option-03, 6 January 2014,

<https://www.ietf.org/archive/id/draft-fossati-core-

publish-option-03.txt>.

[I-D.giacomin-core-sleepy-option]

Fossati, T., Giacomin, P., Loreto, S., and M. Rossini,

"Sleepy Option for CoAP", Work in Progress, Internet-

Draft, draft-giacomin-core-sleepy-option-00, 29 February

2012, <https://www.ietf.org/archive/id/draft-giacomin-

core-sleepy-option-00.txt>.

[I-D.ietf-core-coap-pubsub]

Koster, M., Keranen, A., and J. Jimenez, "Publish-

Subscribe Broker for the Constrained Application Protocol

(CoAP)", Work in Progress, Internet-Draft, draft-ietf-

core-coap-pubsub-10, 4 May 2022,

<https://www.ietf.org/archive/id/draft-ietf-core-coap-

pubsub-10.txt>.

[I-D.ietf-dnssd-srp]

Lemon, T. and S. Cheshire, "Service Registration Protocol

for DNS-Based Service Discovery", Work in Progress,

Internet-Draft, draft-ietf-dnssd-srp-14, 11 July 2022,

<https://www.ietf.org/archive/id/draft-ietf-dnssd-srp-

14.txt>.

[I-D.ietf-tcpm-rfc793bis]

Eddy, W. M., "Transmission Control Protocol (TCP)

Specification", Work in Progress, Internet-Draft, draft-

ietf-tcpm-rfc793bis-28, 7 March 2022,

<https://www.ietf.org/archive/id/draft-ietf-tcpm-

rfc793bis-28.txt>.

[I-D.jennings-energy-pricing]

Jennings, C. and B. Nordman, "Communication of Energy

Price Information", Work in Progress, Internet-Draft,

draft-jennings-energy-pricing-01, 10 July 2011,

<https://www.ietf.org/archive/id/draft-jennings-energy-

pricing-01.txt>.

[I-D.lhan-problems-requirements-satellite-net]

Han, L., Li, R., Retana, A., Chen, M., Su, L., Jiang, T.,

and N. Wang, "Problems and Requirements of Satellite

Constellation for Internet", Work in Progress, Internet-

Draft, draft-lhan-problems-requirements-satellite-net-03,

6 July 2022, <https://www.ietf.org/archive/id/draft-lhan-

problems-requirements-satellite-net-03.txt>.

[I-D.manral-bmwg-power-usage]

Manral, V., Sharma, P., Banerjee, S., and Y. Ping,

"Benchmarking Power usage of networking devices", Work in

Progress, Internet-Draft, draft-manral-bmwg-power-usage-

04, 12 March 2013, <https://www.ietf.org/archive/id/draft-

manral-bmwg-power-usage-04.txt>.

[I-D.mjsraman-panet-inter-as-power-source]

Raman, S., Venkataswami, B. V., Raina, G., and K.

Veezhinathan, "Reducing Power Consumption using BGP with

power source data", Work in Progress, Internet-Draft,

draft-mjsraman-panet-inter-as-power-source-00, 25 January

2013, <https://www.ietf.org/archive/id/draft-mjsraman-

panet-inter-as-power-source-00.txt>.

[I-D.mjsraman-rtgwg-pim-power]

Raman, S., Venkataswami, B. V., Raina, G., and V. Srini,

"Building power optimal Multicast Trees", Work in

Progress, Internet-Draft, draft-mjsraman-rtgwg-pim-power-

02, 27 March 2012, <https://www.ietf.org/archive/id/draft-

mjsraman-rtgwg-pim-power-02.txt>.

[I-D.okamoto-ccamp-midori-gmpls-extension-reqs]

Okamoto, S., "Requirements of GMPLS Extensions for Energy

Efficient Traffic Engineering", Work in Progress,

Internet-Draft, draft-okamoto-ccamp-midori-gmpls-

extension-reqs-02, 14 March 2013,

<https://www.ietf.org/archive/id/draft-okamoto-ccamp-

midori-gmpls-extension-reqs-02.txt>.

[I-D.petrescu-v6ops-ipv6-power-ipv4]

Petrescu, A., Said, S. B. H., Philippot, O., and T.

Vincent, "Power Consumption of IPv6 vs IPv4 in

Smartphone", Work in Progress, Internet-Draft, draft-

petrescu-v6ops-ipv6-power-ipv4-00, 13 March 2017,

<https://www.ietf.org/archive/id/draft-petrescu-v6ops-

ipv6-power-ipv4-00.txt>.

[I-D.rahman-core-sleepy]

Rahman, A., "Enhanced Sleepy Node Support for CoAP", Work

in Progress, Internet-Draft, draft-rahman-core-sleepy-05,

11 February 2014, <https://www.ietf.org/archive/id/draft-

rahman-core-sleepy-05.txt>.

[I-D.rahman-core-sleepy-nodes-do-we-need]

Rahman, A., "Sleepy Devices: Do we need to Support them in

CORE?", Work in Progress, Internet-Draft, draft-rahman-

core-sleepy-nodes-do-we-need-01, 11 February 2014,

<https://www.ietf.org/archive/id/draft-rahman-core-sleepy-

nodes-do-we-need-01.txt>.

[I-D.rahman-core-sleepy-problem-statement]

Rahman, A., Fossati, T., Loreto, S., and M. Vial, "Sleepy

Devices in CoAP - Problem Statement", Work in Progress,

Internet-Draft, draft-rahman-core-sleepy-problem-

statement-01, 21 October 2012,

<https://www.ietf.org/archive/id/draft-rahman-core-sleepy-

problem-statement-01.txt>.

[I-D.retana-rtgwg-eacp]

Retana, A., White, R., and M. Paul, "A Framework and

Requirements for Energy Aware Control Planes", Work in

Progress, Internet-Draft, draft-retana-rtgwg-eacp-03, 24

October 2014, <https://www.ietf.org/archive/id/draft-

retana-rtgwg-eacp-03.txt>.

[I-D.suzuki-eens-requirements]

Suzuki, T. and T. Tarui, "Requirements for an Energy-

Efficient Network System", Work in Progress, Internet-

Draft, draft-suzuki-eens-requirements-00, 15 October 2012,

<https://www.ietf.org/archive/id/draft-suzuki-eens-

requirements-00.txt>.

[I-D.vial-core-mirror-proxy]

Vial, M., "CoRE Mirror Server", Work in Progress,

Internet-Draft, draft-vial-core-mirror-proxy-01, 13 July

2012, <https://www.ietf.org/archive/id/draft-vial-core-

mirror-proxy-01.txt>.

[I-D.vial-core-mirror-server]

Vial, M., "CoRE Mirror Server", Work in Progress,

Internet-Draft, draft-vial-core-mirror-server-01, 10 April

2013, <https://www.ietf.org/archive/id/draft-vial-core-

mirror-server-01.txt>.

[I-D.wang-roll-energy-optimization-scheme]

Wang, H., Wei, M., Li, S., Huang, Q., Wang, P., and C.

Wang, "An energy optimization routing scheme for LLSs",

Work in Progress, Internet-Draft, draft-wang-roll-energy-

optimization-scheme-00, 21 February 2017,

<https://www.ietf.org/archive/id/draft-wang-roll-energy-

optimization-scheme-00.txt>.

[I-D.winter-energy-efficient-internet]

Winter, R., Jeong, S., and J. Choi, "Towards an Energy-

Efficient Internet", Work in Progress, Internet-Draft,

draft-winter-energy-efficient-internet-01, 22 October

2012, <https://www.ietf.org/archive/id/draft-winter-

energy-efficient-internet-01.txt>.

[I-D.zhang-greennet]

Zhang, B., Shi, J., Dong, J., and M. Zhang, "Power-aware

Routing and Traffic Engineering: Requirements, Approaches,

and Issues", Work in Progress, Internet-Draft, draft-

zhang-greennet-01, 10 January 2013,

<https://www.ietf.org/archive/id/draft-zhang-greennet-

01.txt>.

[I-D.zhang-panet-problem-statement]

Zhang, B., Shi, J., Dong, J., Zhang, M., and M. Boucadair,

"Power-Aware Networks (PANET): Problem Statement", Work in

Progress, Internet-Draft, draft-zhang-panet-problem-

statement-03, 15 October 2013,

<https://www.ietf.org/archive/id/draft-zhang-panet-

problem-statement-03.txt>.

[NASPICLOCK]

Force, N. T. S. T., "Time Synchronization in the Electric

Power System", March 2017,

<https://www.naspi.org/sites/default/files/

reference_documents/tstf_electric_power_system_report_pnnl

_26331_march_2017_0.pdf>.

[RFC1112] Deering, S., "Host extensions for IP multicasting", STD 5,

RFC 1112, DOI 10.17487/RFC1112, August 1989,

[RFC1142] Oran, D., Ed., "OSI IS-IS Intra-domain Routing Protocol",

RFC 1142, DOI 10.17487/RFC1142, February 1990,

[RFC1628] Case, J., Ed., "UPS Management Information Base",

RFC 1628, DOI 10.17487/RFC1628, May 1994,

[RFC1866] Berners-Lee, T. and D. Connolly, "Hypertext Markup

Language - 2.0", RFC 1866, DOI 10.17487/RFC1866, November

1995, <https://www.rfc-editor.org/info/rfc1866>.

[RFC1883] Deering, S. and R. Hinden, "Internet Protocol, Version 6

(IPv6) Specification", RFC 1883, DOI 10.17487/RFC1883,

December 1995, <https://www.rfc-editor.org/info/rfc1883>.

[RFC2086] Myers, J., "IMAP4 ACL extension", RFC 2086,

DOI 10.17487/RFC2086, January 1997,

[RFC2212] Shenker, S., Partridge, C., and R. Guerin, "Specification

of Guaranteed Quality of Service", RFC 2212,

DOI 10.17487/RFC2212, September 1997,

[RFC2246] Dierks, T. and C. Allen, "The TLS Protocol Version 1.0",

RFC 2246, DOI 10.17487/RFC2246, January 1999,

[RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328,

DOI 10.17487/RFC2328, April 1998,

[RFC2475] Blake, S., Black, D., Carlson, M., Davies, E., Wang, Z.,

and W. Weiss, "An Architecture for Differentiated

Services", RFC 2475, DOI 10.17487/RFC2475, December 1998,

[RFC2543] Handley, M., Schulzrinne, H., Schooler, E., and J.

Rosenberg, "SIP: Session Initiation Protocol", RFC 2543,

DOI 10.17487/RFC2543, March 1999,

[RFC3433] Bierman, A., Romascanu, D., and K.C. Norseth, "Entity

Sensor Management Information Base", RFC 3433,

DOI 10.17487/RFC3433, December 2002,

[RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V.

Jacobson, "RTP: A Transport Protocol for Real-Time

Applications", STD 64, RFC 3550, DOI 10.17487/RFC3550,

[RFC3621] Berger, A. and D. Romascanu, "Power Ethernet MIB",

RFC 3621, DOI 10.17487/RFC3621, December 2003,

[RFC3948] Huttunen, A., Swander, B., Volpe, V., DiBurro, L., and M.

Stenberg, "UDP Encapsulation of IPsec ESP Packets",

RFC 3948, DOI 10.17487/RFC3948, January 2005,

[RFC3977] Feather, C., "Network News Transfer Protocol (NNTP)",

RFC 3977, DOI 10.17487/RFC3977, October 2006,

[RFC4268] Chisholm, S. and D. Perkins, "Entity State MIB", RFC 4268,

DOI 10.17487/RFC4268, November 2005,

[RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A

Border Gateway Protocol 4 (BGP-4)", RFC 4271,

DOI 10.17487/RFC4271, January 2006,

[RFC4607] Holbrook, H. and B. Cain, "Source-Specific Multicast for

IP", RFC 4607, DOI 10.17487/RFC4607, August 2006,

[RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman,

"Neighbor Discovery for IP version 6 (IPv6)", RFC 4861,

[RFC4944] Montenegro, G., Kushalnagar, N., Hui, J., and D. Culler,

"Transmission of IPv6 Packets over IEEE 802.15.4

Networks", RFC 4944, DOI 10.17487/RFC4944, September 2007,

[RFC4949] Shirey, R., "Internet Security Glossary, Version 2",

FYI 36, RFC 4949, DOI 10.17487/RFC4949, August 2007,

[RFC5548] Dohler, M., Ed., Watteyne, T., Ed., Winter, T., Ed., and

D. Barthel, Ed., "Routing Requirements for Urban Low-Power

and Lossy Networks", RFC 5548, DOI 10.17487/RFC5548, May

[RFC5673] Pister, K., Ed., Thubert, P., Ed., Dwars, S., and T.

Phinney, "Industrial Routing Requirements in Low-Power and

Lossy Networks", RFC 5673, DOI 10.17487/RFC5673, October

[RFC5826] Brandt, A., Buron, J., and G. Porcu, "Home Automation

Routing Requirements in Low-Power and Lossy Networks",

RFC 5826, DOI 10.17487/RFC5826, April 2010,

[RFC5867] Martocci, J., Ed., De Mil, P., Riou, N., and W. Vermeylen,

"Building Automation Routing Requirements in Low-Power and

Lossy Networks", RFC 5867, DOI 10.17487/RFC5867, June

[RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch,

"Network Time Protocol Version 4: Protocol and Algorithms

Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010,

[RFC6206] Levis, P., Clausen, T., Hui, J., Gnawali, O., and J. Ko,

"The Trickle Algorithm", RFC 6206, DOI 10.17487/RFC6206,

March 2011, <https://www.rfc-editor.org/info/rfc6206>.

[RFC6272] Baker, F. and D. Meyer, "Internet Protocols for the Smart

Grid", RFC 6272, DOI 10.17487/RFC6272, June 2011,

[RFC6282] Hui, J., Ed. and P. Thubert, "Compression Format for IPv6

Datagrams over IEEE 802.15.4-Based Networks", RFC 6282,

[RFC6550] Winter, T., Ed., Thubert, P., Ed., Brandt, A., Hui, J.,

Kelsey, R., Levis, P., Pister, K., Struik, R., Vasseur,

JP., and R. Alexander, "RPL: IPv6 Routing Protocol for

Low-Power and Lossy Networks", RFC 6550,

DOI 10.17487/RFC6550, March 2012,

[RFC6690] Shelby, Z., "Constrained RESTful Environments (CoRE) Link

Format", RFC 6690, DOI 10.17487/RFC6690, August 2012,

[RFC6763] Cheshire, S. and M. Krochmal, "DNS-Based Service

Discovery", RFC 6763, DOI 10.17487/RFC6763, February 2013,

[RFC6775] Shelby, Z., Ed., Chakrabarti, S., Nordmark, E., and C.

Bormann, "Neighbor Discovery Optimization for IPv6 over

Low-Power Wireless Personal Area Networks (6LoWPANs)",

RFC 6775, DOI 10.17487/RFC6775, November 2012,

[RFC6887] Wing, D., Ed., Cheshire, S., Boucadair, M., Penno, R., and

P. Selkirk, "Port Control Protocol (PCP)", RFC 6887,

DOI 10.17487/RFC6887, April 2013,

[RFC6933] Bierman, A., Romascanu, D., Quittek, J., and M.

Chandramouli, "Entity MIB (Version 4)", RFC 6933,

DOI 10.17487/RFC6933, May 2013,

[RFC6988] Quittek, J., Ed., Chandramouli, M., Winter, R., Dietz, T.,

and B. Claise, "Requirements for Energy Management",

RFC 6988, DOI 10.17487/RFC6988, September 2013,

[RFC7030] Pritikin, M., Ed., Yee, P., Ed., and D. Harkins, Ed.,

"Enrollment over Secure Transport", RFC 7030,

DOI 10.17487/RFC7030, October 2013,

[RFC7102] Vasseur, JP., "Terms Used in Routing for Low-Power and

Lossy Networks", RFC 7102, DOI 10.17487/RFC7102, January

[RFC7326] Parello, J., Claise, B., Schoening, B., and J. Quittek,

"Energy Management Framework", RFC 7326,

[RFC7460] Chandramouli, M., Claise, B., Schoening, B., Quittek, J.,

and T. Dietz, "Monitoring and Control MIB for Power and

Energy", RFC 7460, DOI 10.17487/RFC7460, March 2015,

[RFC7461] Parello, J., Claise, B., and M. Chandramouli, "Energy

Object Context MIB", RFC 7461, DOI 10.17487/RFC7461, March

[RFC7577] Quittek, J., Winter, R., and T. Dietz, "Definition of

Managed Objects for Battery Monitoring", RFC 7577,

DOI 10.17487/RFC7577, July 2015,

[RFC7603] Schoening, B., Chandramouli, M., and B. Nordman, "Energy

Management (EMAN) Applicability Statement", RFC 7603,

DOI 10.17487/RFC7603, August 2015,

[RFC7641] Hartke, K., "Observing Resources in the Constrained

Application Protocol (CoAP)", RFC 7641,

[RFC768] Postel, J., "User Datagram Protocol", STD 6, RFC 768,

DOI 10.17487/RFC0768, August 1980,

[RFC7733] Brandt, A., Baccelli, E., Cragie, R., and P. van der Stok,

"Applicability Statement: The Use of the Routing Protocol

for Low-Power and Lossy Networks (RPL) Protocol Suite in

Home Automation and Building Control", RFC 7733,

DOI 10.17487/RFC7733, February 2016,

[RFC7772] Yourtchenko, A. and L. Colitti, "Reducing Energy

Consumption of Router Advertisements", BCP 202, RFC 7772,

DOI 10.17487/RFC7772, February 2016,

[RFC7849] Binet, D., Boucadair, M., Vizdal, A., Chen, G., Heatley,

N., Chandler, R., Michaud, D., Lopez, D., and W. Haeffner,

"An IPv6 Profile for 3GPP Mobile Devices", RFC 7849,

DOI 10.17487/RFC7849, May 2016,

[RFC791] Postel, J., "Internet Protocol", STD 5, RFC 791,

[RFC793] Postel, J., "Transmission Control Protocol", STD 7,

RFC 793, DOI 10.17487/RFC0793, September 1981,

[RFC8036] Cam-Winget, N., Ed., Hui, J., and D. Popa, "Applicability

Statement for the Routing Protocol for Low-Power and Lossy

Networks (RPL) in Advanced Metering Infrastructure (AMI)

Networks", RFC 8036, DOI 10.17487/RFC8036, January 2017,

[RFC822] Crocker, D., "STANDARD FOR THE FORMAT OF ARPA INTERNET

TEXT MESSAGES", STD 11, RFC 822, DOI 10.17487/RFC0822,

August 1982, <https://www.rfc-editor.org/info/rfc822>.

[RFC8352] Gomez, C., Kovatsch, M., Tian, H., and Z. Cao, Ed.,

"Energy-Efficient Features of Internet of Things

Protocols", RFC 8352, DOI 10.17487/RFC8352, April 2018,

[RFC8368] Eckert, T., Ed. and M. Behringer, "Using an Autonomic

Control Plane for Stable Connectivity of Network

Operations, Administration, and Maintenance (OAM)",

RFC 8368, DOI 10.17487/RFC8368, May 2018,

[RFC8428] Jennings, C., Shelby, Z., Arkko, J., Keranen, A., and C.

Bormann, "Sensor Measurement Lists (SenML)", RFC 8428,

DOI 10.17487/RFC8428, August 2018,

[RFC8575] Jiang, Y., Ed., Liu, X., Xu, J., and R. Cummings, Ed.,

"YANG Data Model for the Precision Time Protocol (PTP)",

RFC 8575, DOI 10.17487/RFC8575, May 2019,

[RFC8613] Selander, G., Mattsson, J., Palombini, F., and L. Seitz,

"Object Security for Constrained RESTful Environments

(OSCORE)", RFC 8613, DOI 10.17487/RFC8613, July 2019,

[RFC8724] Minaburo, A., Toutain, L., Gomez, C., Barthel, D., and JC.

Zuniga, "SCHC: Generic Framework for Static Context Header

Compression and Fragmentation", RFC 8724,

DOI 10.17487/RFC8724, April 2020,

[RFC8815] Abrahamsson, M., Chown, T., Giuliano, L., and T. Eckert,

"Deprecating Any-Source Multicast (ASM) for Interdomain

Multicast", BCP 229, RFC 8815, DOI 10.17487/RFC8815,

August 2020, <https://www.rfc-editor.org/info/rfc8815>.

[RFC8824] Minaburo, A., Toutain, L., and R. Andreasen, "Static

Context Header Compression (SCHC) for the Constrained

Application Protocol (CoAP)", RFC 8824,

DOI 10.17487/RFC8824, June 2021,

[RFC8994] Eckert, T., Ed., Behringer, M., Ed., and S. Bjarnason, "An

Autonomic Control Plane (ACP)", RFC 8994,

DOI 10.17487/RFC8994, May 2021,

[RFC9008] Robles, M.I., Richardson, M., and P. Thubert, "Using RPI

Option Type, Routing Header for Source Routes, and IPv6-

in-IPv6 Encapsulation in the RPL Data Plane", RFC 9008,

DOI 10.17487/RFC9008, April 2021,

[RFC9010] Thubert, P., Ed. and M. Richardson, "Routing for RPL

(Routing Protocol for Low-Power and Lossy Networks)

Leaves", RFC 9010, DOI 10.17487/RFC9010, April 2021,

[RFC9011] Gimenez, O., Ed. and I. Petrov, Ed., "Static Context

Header Compression and Fragmentation (SCHC) over LoRaWAN",

RFC 9011, DOI 10.17487/RFC9011, April 2021,

[RFC9119] Perkins, C., McBride, M., Stanley, D., Kumari, W., and JC.

Zúñiga, "Multicast Considerations over IEEE 802 Wireless

Media", RFC 9119, DOI 10.17487/RFC9119, October 2021,

[RFC9148] van der Stok, P., Kampanakis, P., Richardson, M., and S.

Raza, "EST-coaps: Enrollment over Secure Transport with

the Secure Constrained Application Protocol", RFC 9148,

DOI 10.17487/RFC9148, April 2022,

[RFC9176] Amsüss, C., Ed., Shelby, Z., Koster, M., Bormann, C., and

P. van der Stok, "Constrained RESTful Environments (CoRE)

Resource Directory", RFC 9176, DOI 10.17487/RFC9176, April

[RFC9178] Arkko, J., Eriksson, A., and A. Keränen, "Building Power-

Efficient Constrained Application Protocol (CoAP) Devices

for Cellular Networks", RFC 9178, DOI 10.17487/RFC9178,

May 2022, <https://www.rfc-editor.org/info/rfc9178>.

[VC2014] Ong, D., Moors, T., and V. Sivaraman, "Comparison of the

energy, carbon and time costs of videoconferencing and in-

person meetings", DOI 10.1016/j.comcom.2014.02.009, 2014,

<https://www.sciencedirect.com/science/article/pii/

S0140366414000620>.

Toerless Eckert (editor)

Futurewei Technologies USA

Santa Clara, CA 95050

United States of America

Email: tte@cs.fau.de

Mohamed Boucadair

Orange

35000 Rennes

France

Email: mohamed.boucadair@orange.com

Pascal Thubert

Cisco Systems, Inc.

45 Allee des Ormes - BP1200, Building D

06254 MOUGINS Sophia Antipolis

France

Phone: +33 497 23 26 34

Email: pthubert@cisco.com

Jeff Tentsura

Microsoft

Internet Research Task Force J. François

Internet-Draft Inria

Intended status: Informational A. Clemm

Expires: 12 January 2023 Futurewei Technologies, Inc.

D. Papadimitriou

S. Fernandes

Central Bank of Canada

S. Schneider

Digital Railway (DSD) at Deutsche Bahn

11 July 2022

Research Challenges in Coupling Artificial Intelligence and Network

Management

draft-francois-nmrg-ai-challenges-00

Abstract

This document is intended to introduce the challenges to overcome

when network management problems may require to be couple with AI

solutions. On one hand, there are many difficult problems in Network

Management that to this date have no good solutions, or where any

solutions come with significant limitations and constraints.

Artificial Intelligence may help produce novel solutions to those

problems. On the other hand, for several reasons (computational

costs of AI solutions, privacy of data), distribution of AI tasks

became primordial. It is thus also expected that network SHOULD be

operated efficiently to support those tasks.

To identify the right set of challenges, the document defines a

method based on the evolution and nature of NM problems. This will

be done in parallel with advances and the nature of existing

solutions in AI in order to highlight where AI and NM have been

already coupled together or could benefit from a higher integration.

So, the method aims at evaluating the gap between NM problems and AI

solutions. Challenges are derived accordingly, assuming solving

these challenges will help to reduce the gap between NM and AI.

Status of This Memo

François, et al. Expires 12 January 2023 [Page 1]

Internet-Draft Coupling AI and network management July 2022

Copyright Notice

Provisions Relating to IETF Documents (https://trustee.ietf.org/

license-info) in effect on the date of publication of this document.

Please review these documents carefully, as they describe your rights

and restrictions with respect to this document.

Table of Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3

2. Conventions and Definitions . . . . . . . . . . . . . . . . . 5

3. Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4. Difficult problems in network management . . . . . . . . . . 5

5. High-level challenges in adopting AI in NM . . . . . . . . . 8

6. AI techniques for network management . . . . . . . . . . . . 10

6.1. Problem type and mapping . . . . . . . . . . . . . . . . 10

6.1.1. Sub-challenge: Suitable Approach for Given Input . . 10

6.1.2. Sub-challenge: Suitable Approach for Desired

Output . . . . . . . . . . . . . . . . . . . . . . . 11

6.1.3. Sub-challenge: Tailoring the AI Approach to the Given

Problem . . . . . . . . . . . . . . . . . . . . . . . 12

6.2. Performance of produced models . . . . . . . . . . . . . 13

6.3. Lightweight AI . . . . . . . . . . . . . . . . . . . . . 15

6.4. AI for planning of actions . . . . . . . . . . . . . . . 16

7. Network data as input for ML algorithms . . . . . . . . . . . 18

7.1. Data for AI-based NM solutions . . . . . . . . . . . . . 19

7.2. Data collection . . . . . . . . . . . . . . . . . . . . . 20

7.3. Usable data . . . . . . . . . . . . . . . . . . . . . . . 21

8. Acceptability of AI . . . . . . . . . . . . . . . . . . . . 22

8.1. Explainability of Network-AI products . . . . . . . . . 23

8.2. AI-based products and algorithms in production

systems . . . . . . . . . . . . . . . . . . . . . . . . . 24

8.3. AI with humans in the loop . . . . . . . . . . . . . . . 25

11. References . . . . . . . . . . . . . . . . . . . . . . . . . 26

11.1. Normative References . . . . . . . . . . . . . . . . . . 26

11.2. Informative References . . . . . . . . . . . . . . . . . 26

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 32

1. Introduction

The functional scope of network management (NM) is very large,

ranging from monitoring to accounting, from network provisioning to

service diagnostics, from usage accounting to security. The taxonomy

defined in [Hoo18] extends the traditional Fault, Configuration,

Accounting, Performance, Security (FCAPS) domains by considering

additional functional areas but above all by promoting additional

views. For instance, network management approaches can be classified

according to the technologies, methods or paradigms they will rely

on. Methods include common approaches as for example mathematical

optimization or queuing theory but also techniques which have been

widely applied in last decades like game theory, data analysis, data

mining and machine learning. In management paradigms, autonomic and

cognitive management are listed. As highlighted by this taxonomy,

the definition of automated and more intelligent techniques have been

promoted to support efficient network management operations.

Research in NM and more generally in networking has been very active

in the area of applied ML [Bou18].

However, for maintaining network operational in pre-defined safety

bounds, NM still heavily relies on established procedures. Even

after several cycles of adding automation, those procedures are still

mostly fixed in the sense that the exact control loop is and all

possibilities are defined in advance. They are so mostly

deterministic by nature or or at least with maximal error bounds

.Obviously, there have been a lot of propositions to make network

smarter or intelligent with the use of ML but without large adoption

for running real networks because it changes the paradigms towards

stochastic methods.

ML is a sub-area of AI that concentrates the focus nowadays but AI

encompasses other areas including knowledge representation, inference

rule engine, statistical methods or by extension the techniques that

allow to observe and perform actions on a system.

It is thus legitimate to question if ML or AI in general could be

helpful for NM in regards to practical deployment. This question is

actually tight with the problems the NM aims to address.

Independently of NM, ML solutions were introduced to solve one type

of problems in an approximate way which are very complex in nature,

i.e. finding an optimal solution is not possible (in polynomial

time). This is the case for NP-hard problems. In those cases,

solutions typically rely on heuristics that may not yield optimal

results, or algorithms that run into issues with scalability and the

ability to produce timely results due to the exponential search

space. In NM, those problems exist, for instance allocation of

resources in case of service function chaining or network slicing

among others are recent examples which have gained interest in our

community with SDN. Many propositions consist of defining the

problem as an MILP with some heuristics to reach a satisfactory

tradeoff between solution quality - computation time and model size/

dimensionality. Hence, ML is recognized to be well adapted to

progress on this type of problem [Kaf19].

However, all problems of NM are not NP-hard. Due to real-time

constraints, some involve very short control loops that require both

rapid decisions and the ability to rapidly adapt to new situations

and different contexts. So, even in that case, time is critical and

approximate solutions are usually more acceptable. Again, it is

where AI can be beneficial. Actually expert systems are AI systems

[Ste92] but this kind of systems are not designed to scale with the

volume and heterogeneity of data we can collect in a network today

for which the expert system is built thanks to numerous inference

rules. In contrast, ML is more efficient to automatically learn

abstract representations of the rules, which can be eventually

updated.

On one hand Another type of common problem in NM is classification.

For instance, classifying network flows is helpful for security

purposes to detect attack flows, to differentiate QoS among the

different flows (e.g. real-time streams which need to be

prioritized), etc. On the other hand, ML-based classification

algorithms have been widely used in literature with high quality

results when properly applied leading to their applications in

commercial products. There are many algorithms including decision

tress, support vector machine ir (deep) neural networks which have

been to be proven efficient in many areas and notably for image and

natural language processing.

Finally, many problems also still rely on humans in the loop, from

support issues such as dealing with trouble tickets to planning

activities for the roll-out of new services. This creates

operational bottlenecks and is often expensive and error prone. This

kind of tasks could be either automated or guided by an AI system to

avoid human bias. Indeed, the balance between human resources and

the complexity of problems to deal with is actually very imbalanced

and this will continue to increase due to the size of networks,

heterogeneity of devices, services, etc. Hence, human-based

procedures tend to be simple in comparison to the problem to solve or

time-consuming. Notable examples are in security where the network

operator should defend against potential unknown threat. As a

result, services might be largely affected during hours

Actually, all the problems aforementioned are exacerbated by the

situation of more complex networks to operate on many dimensions

(users, devices, services, connections, etc.). Therefore, AI is

expected to enable or simplify the solving of those problems in real

networks in the near future [czb20] [Yan20] because those would

require reaching unprecedented levels of performance in terms of

throughput, latency, mobility, security, etc.

2. Conventions and Definitions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",

"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and

"OPTIONAL" in this document are to be interpreted as described in

BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all

capitals, as shown here.

3. Acronyms

AI: Artificial Intelligence GAN: Generative Adversarial Network GNN:

Graph Neural Network LSTM: Long Short-Term Memory ML: Machine

Learning MLP: Multilayer Perceptron NM: Network Management

4. Difficult problems in network management

As mentioned in introduction, problems to be tackled in NM tend to be

complex and exhibit characteristics that make them candidates for

solutions that involve AI techniques:

* C1: A very large solution space, combinatorially exploding with

the size of the problem domain. This makes it impractical to

explore and test every solution (again NP-hard problems here)

* C2: Uncertainty and unpredictability along multiple dimensions,

including the context in which the solution is applied, behavior

of users and traffic, lack of visibility into network state, and

more. In addition, many networks do not exist in isolation but

are subjected to myriads of interdependencies, some outside their

control. Accordingly, there are many external parameters that

affect the efficiency of the solution to a problem and that cannot

be known in advance: user activity, interconnected networks, etc.

* C3: The need to provide answers (i.e. compute solutions, deliver

verdicts, make decisions) in constrained or deterministic time.

In many cases, context changes dynamically and decisions need to

be made quickly to be of use.

* C4: Data-dependent solutions. To solve a problem accurately, it

can be necessary to rely on large volumes of data, having to deal

with issues that range from data heterogeneity to incomplete data

to general challenges of dealing with high data velocity.

* C5: Need to be integrated with existing automatic and human

processes.

* C6: Solutions MUST be cost-effective as resources (bandwidth, CPU,

human, etc.) can be limited, notably when part of processing is

distributed at the network edge or within the network.

Many problems are affected by multiple criteria. Below is a non-

exhaustive list of complex NM problems for which AI and/or non-AI-

based approaches have been proposed:

* Computation of optimal paths: packet forwarding is not always

based on traditional routing protocols with least cost routing,

but on computation of paths that are optimized for certain

criteria - for example, to meet certain level objectives, to

result in greater resilience, to balance utilization, to optimize

energy usage, etc. Many of those solutions can be found in SDN,

where a controller or path computation element computes paths that

are subsequently provisioned across the network. However, such

solutions generally do not scale to millions of paths (C1), and

cannot be recomputed in sub-second time scales (C3) to take into

account dynamically changing network conditions (C2). To compute

those paths, operations research techniques have been extensively

used in literature along with AI methods as shown in [Lop20]. As

such, this problem can be considered as close to big data problems

with some of the different Vs: volume, velocity, variety, value...

* Classification of network traffic: without loss of generality a

common objective of network monitoring for operators is to know

the type of traffic going through their networks (web, streaming,

gaming, VoIP). By nature, this task analyzes data (C4) which can

vary over time (C2) except in very particular scenarios like

industrial isolated networks. However, the output of the

classification technique is time-constrained only in specific

cases where fast decisions MUST be made, for example to reroute

traffic. Simple identification based on IANA-assigned TCP/UDP

ports numbers were sufficient in the past. However, with

applications using dynamic port numbers, signature techniques can

be used to match packet payload [Sen04]. To handle applications

now encapsulated in encrypted web or VPN traffic, machine-learning

has been leveraged [Bri19].

* Network diagnostics: disruptions of networking services can have

many causes. Identifying the root cause can be of high importance

when what is causing the disruption is not properly understood, so

that repair actions can address the root cause versus just working

around the symptoms. Further complicating the matter are

scenarios in which disruptions are not "hard" but involve only a

degradation of service level, and where disruptions are

intermittent, not reproducible, and hard to predict. Artificial

intelligence techniques can offer promising solutions.

* Intent-Based Networking (IBN): Roughly speaking, IBN refers to the

ability to manage networks by articulating desired outcomes

without the need to specify a course of actions to achieve those

outcomes. The ability to determine such courses of actions, in

particular in scenarios with multiple interdependencies,

conflicting goals, large scale, and highly complex and dynamic

environments is a huge and largely unsolved challenge. Artificial

Intelligence techniques can be of help here in multiple ways, from

accurately classifying dynamic context to determine matching

actions to reframing the expression of intent as a game that can

be played (and won) using artificially intelligent techniques.

* VNF placement and SFC design: Virtual Network Functions need to be

placed on physical resources and Service Function Chains designed

in an optimized manner to avoid use of networking resources and

minimize energy usage.

* Smart admission control to avoid congestion and oversubscription

of network resources: Admission control needs to be set up and

performed in ways that ensure service levels are optimized in a

manner that is fair and aligned with application needs, congestion

avoided or its effects mitigated.

5. High-level challenges in adopting AI in NM

As shown in the previous section, AI techniques are good candidates

for the difficult NM problems. There have been many propositions but

still most of them remain at the level of prototypes or have been

only evaluated with simulation and/or emulation. It is thus

questionable why our community investigates much research in this

direction but has not adopted those solutions to operate real

networks. There are different obstacles.

First, AI advances have been historically driven by the image/video,

natural language and signal processing communities as well as

robotics for many decades. As a result, the most impressive

applications are in this area including recently the generalization

of home assistants or the large progress in autonomous vehicles.

However, the network experts have been focused on building the

Internet, especially building protocols to make the world

interconnected and with always better performance and services. This

trend continues today with the 5G in deployment and 6G under

definition. Hence, AI was not our primary focus. However, AI is now

considered as a core enabler for the future 6G networks which are

sometimes qualified as AI-native networks.

While we can see major contributions in AI-based solutions for

networking over more than two decades, only a fraction of the

community was concerned by AI at that time. Progress as a whole,

from a community perspective, was so limited and compensated by

relying on the development of AI in the communities as mentioned

earlier. Even if our problems share some commonalities, for example

on the volume of data to analyze, there are many differences: data

types are completely different, networks are by nature heavily

distributed, etc. If problems are different, they SHOULD require

distinct solutions. In a nutshell, network-tailored AI was

overlooked.

Second, many AI techniques require enough representative data to be

applied independently if the algorithms are supervised or

unsupervised. NM has produced a lot of methods and technologies to

acquire data. However, in most cases, the goal was not to support AI

techniques and lead so to a mismatch. For example, (deep) learning

techniques mostly rely on having vectors of (real) numbers as input

which fits some metrics (packet/byte counts, latency, delays, etc)

but needs some adjustment for categorical (IP addresses, port

numbers, etc) or topological features. Conversions are usually

applied using common techniques like one-hot encoding or by coarse-

grained representations [Sco11]. However, more advanced techniques

have been recently proposed to embed representation of network

entities rather than pure encoding [Rin17][Evr19][Sol20].

An additional challenge concerns the fact that AI techniques that

involve analysis of networking data can also lead to the extraction

of sensitive and personally-identifiable information, raising

potential privacy concerns and concerns regarding the potential for

abuse. For example, AI techniques used to analyze encrypted network

traffic with the legitimate goal to protect the network from

intrusions and illegitimate attack traffic could be used to infer

information about network usage and interactions of network users.

Intelligent data analysis and the need to maintain privacy are in

many ways that are contradictory in nature, resulting in an arms

race. Similarly, training ML solutions on real network data is in

many cases preferable over using less-realisitic synthetic data

sets.However, network data may contain private or sensitive data, the

sharing of which may be problematic from a privacy standpoint and

even result in legal exposure. The challenge concerns thus how to

allow AI techniques to perform legitimate network management

functions and provide network owners with operational insights into

what is going on in their networks, while prohibiting their potential

for abuse for other (illegitimate) purposes.

Finally, networks are already operated thanks to (semi-)automated

procedures involving a large number of resources which are

synchronized with management or orchestration tools. Adding AI

supposes it would be seamlessly integrated within pre-existing

processes. Although the goal of these procedures might be solely to

provide relevant information to operators through alerts or

dashboards in case of monitoring applications, many other

applications rely on those procedures to trigger actions on the

different resources, which can be local or remote. The use of AI or

any other approaches to derive NM actions adds further constraint on

them, especially regarding time constraints and synchronization to

maintain a coherence over a distributed system.

A related challenge concerns the fact that to be deployed, a solution

needs to not only provide a technical solution but to also be

acceptable to users - in this case, network administrators and

operators. One challenge with automated solutions concerns that

users want to feel "in control" and able to understand what is going

on, even more so if ultimately those users are the ones who are held

accountable for whether or not the network is running smoothly.

Those same concerns extend to artificially intelligent systems for

obvious reasons. To mitigate those concerns, aspects such as the

ability to explain actions that are taken - or about to be taken - by

AI systems become important.

Beyond reasons of making users more comfortable, there are

potentially also legal or regulatory ramifications to ensure that

actions taken are properly understood. For example,agencies such as

the FCC may impose fines on network operators when services such as

E911 experience outages, as there is a public interest in ensuring

highest availability for such services. In investigating causes for

such outages, the underlying behavior of systems has to be properly

understood, and even more so the reasons for actions that fall under

the realm of network operations.

6. AI techniques for network management

6.1. Problem type and mapping

In the last few years, an increasing number of different AI

techniques have been proposed and applied successfully to a growing

variety of different problems in different domains, including network

management [Mus18], [Xie18]. Some of the more recently proposed AI

approaches are clearly advancements of older approaches, which they

supersede. Many other AI approaches are not predecessors or

successors but simply complementary because they are useful for

different problems or optimize different metrics. In fact, different

AI approaches are useful for different kinds of problem inputs (e.g.,

tabular data vs. text vs. images vs. time series) and also for

different kinds of desired outputs (e.g., a predicted value, a

classification, or an action). Similarly, there may be trade-offs

between multiple approaches that take the same kind of inputs and

desired outputs (e.g., in terms of desired objective, computation

complexity, constraints).

Overall, it is a key challenge of using AI for network management to

properly understand and map which kind of problems with which inputs,

outputs, and objectives are best solved with which kind of AI (or

non-AI) approaches. Given the wealth of existing and newly released

AI approaches, this is far from a trivial task.

6.1.1. Sub-challenge: Suitable Approach for Given Input

Different problems in network management come with widely different

problem parameters. For example, security-related problems may have

large amounts of text or encrypted data as input, whereas forecasting

problems have historical time series data as input. They also vary

in the amount of available data.

Both the type and amount of data influences which AI techniques could

be useful. On one hand, in scenarios with little data, classical

machine learning techniques (e.g., SVM, tree-based approaches, etc.)

are often sufficient and even superior to neural networks. On the

other hand, neural networks have the advantage of learning complex

models from large amounts of data without requiring feature

engineering. Here, different neural network architectures are useful

for different kinds of problems. The traditional and simplest

architecture are (fully connected) multi-layer perceptrons (MLPs),

which are useful for structured, tabular data. For images, videos,

or other high-dimensional data with correlation between "close"

features, convolutional neural networks (CNNs) are useful. Recurrent

neural networks (RNNs), especially LSTMs, and attention-based neural

networks (transformers) are great for sequential data like time

series or text. Finally, Graph Neural Networks (GNNs) can

incorporate and consider the graph-structured input, which is very

useful in network management, e.g., to represent the network

topology.

The aforementioned rough guidelines can help identify a suitable AI

approach and neural network architecture. Still, best results are

often only achieved with sophisticated combinations of different

approaches. For example, multiple elements can be combined into one

architecture, e.g., with both CNNs and LSTMs, and multiple separate

AI approaches can be used as an ensemble to combine their strengths.

Here, simplifying the mapping from problem type and input to suitable

AI approaches and architectures is clearly an open challenge. Future

work SHOULD address this challenge by providing both clearer

guidelines and striving for more general AI approaches that can

easily be applied to a large variety of different problem inputs.

6.1.2. Sub-challenge: Suitable Approach for Desired Output

Similar to the challenge of identifying suitable AI approaches for a

given problem input, the desired output for a given problem also

affects which AI approach SHOULD be chosen. Here, the format of the

desired output (single value, class, action, etc.), the frequency of

these outputs and their meaning SHOULD be considered.

Again, there are rough guidelines for identifying a group of suitable

AI approaches. For example, if a single value is required (e.g., the

amount of resources to allocate to a service instance), then typical

supervised regression approaches SHOULD be used. If classification

(e.g., of malware or another security issue [Abd10]) instead of a

value is desired, supervised classification methods SHOULD be used.

Alternatively, unsupervised machine learning can help to cluster

given data into separate groups, which can be useful to analyze

networking data, e.g., for better understanding different types of

traffic or user segments.

In addition to these classical supervised and unsupervised methods,

reinforcement learning approaches allow active, sequential decisions

rather than simple predictions or classifications. This is often

useful in network management, e.g., to actively control service

scaling and placement as well as flow scheduling and routing.

Reinforcement learning agents autonomously select suitable actions in

a given environment and are especially useful for self-learning

network management. In addition to model-free reinforcement

learning, model-based planning approaches (e.g., Monte Carlo Tree

Search (MCTS)) also allow choosing suitable actions in a given

environment but require full knowledge of the environment dynamics.

In contrast, model-free reinforcement learning is ideal for scenarios

with unknown environment dynamics, which is often the case in network

management.

Similar to the previous sub-challenge, these are just rough

guidelines that can help to select a suitable group of AI approaches.

Identifying the most suitable approach within the group, e.g., the

best out of the many existing reinforcement learning approaches, is

still challenging. And, as before, different approaches could be

combined to enable even more effective network management (e.g.,

heuristics + RL, LSTMs + RL, ...). Here, further research MAY

simplify the mapping from desired problem output to choosing or

designing a suitable AI approach.

6.1.3. Sub-challenge: Tailoring the AI Approach to the Given Problem

After addressing the two aforementioned sub-challenges, one may have

selected a useful kind of AI approach for the given input and output

of a network management problem. For example, one may select

regression and supervised learning to forecast upcoming network

traffic. Or select reinforcement learning to continuously control

network and service coordination (scaling, placement, etc.).

However, even within each of these fields (regression, reinforcement

learning, etc.), there are many possible algorithms and

hyperparameters to consider. Selecting a suitable algorithm and

parametrizing it with the right hyperparameters is crucial to tailor

the AI approach to the given network management problem.

For example, there are many different regression techniques

(classical linear, polynomial regression, lasso/ridge regression,

SVR, regression trees, neural networks, etc.), each with different

benefits and drawbacks and each with its own set of hyperparameters.

Choosing a suitable technique depends on the amount and structure of

the input data as well as on the desired output. It also depends on

the available amount of compute resources and compute time until a

prediction is required. If resources and time are not a limiting

factor, many hyperparameters can be tuned automatically. In

practice, however, the design space of choosing algorithms and

hyperparameters is often so large that it cannot be effectively tuned

automatically but also requires some initial expertise in selecting

suitable AI algorithms and hyperparameters.

This sub-challenge holds for all fields of AI: Supervised learning

(regression and classification), self-supervised learning,

unsupervised learning, and reinforcement learning, each are broad and

rapidly growing fields. Selecting suitable algorithms and

hyperparameters to tailor AI approaches to the network management

problem is both an opportunity and a challenge. Here, future work

should further explore these trade-offs and provide clearer

guidelines on how to navigate these trade-offs for different network

management tasks.

6.2. Performance of produced models

From a general point of view, any AI technique will produce results

with a certain level of quality. This leads to two inherent

questions: (1) what is the definition of the performance in a context

of a NM application? (2) How to measure it? and (3) How to ensure/

improve the quality of produced results?

Many metrics have been already defined to evaluate the performance of

an AI-based techniques in regards to its NM-level objectives. For

example, QoS metrics (throughput, latency) can serve to measure the

performance of a routing algorithm along with the computational

complexity (memory consumption, size of routing tables). The

question is to model and measure these two antagonist types of

metrics. Number of true/false positives/negatives are the most basic

metrics for network attack detection functions. Although the first

two questions are thus already answered even if improvement can be

done, question (3) refers to the integration of metrics into AI

algorithms. Its objective is to obtain the best results which need

to be quantified with these metrics. Depending on the type of

algorithm, these metrics are either evaluated in an online manner

with a feedback loop (for example with reinforcement learning) or in

batch to optimize a model based on a particular context (for example

described by a dataset for machine learning).

The problem is two-fold. First, the performance can be measured

through multiple metrics of different types (numerical or ordinal for

example) and some can be constrained by fixed boundaries (like a

maximum latency), making their joint use challenging when creating an

AI model to resolve a NM problem. Second, the scale metrics differ

from each other in terms of importance or impact and can eventually

vary on their domains. It can be hard to precisely assess what is a

good or bad value (as it might depend on multiple other ones) and it

is even more difficult to integrate in an AI technique, especially

for learning algorithms to adjust their models based on the

performance. Indeed, learning algorithms run through multiple

iterations and rely on internal metrics (MAE or (R)MSE for neural

network, gini index or entropy for decision trees, distance to an

hyperplane for SVMs, etc) which are not strongly correlated to the

final metrics of the application. For instance, a decision tree

algorithm for classification purposes aims at being able to create

branches with a maximum of data from the same classes and so avoid

mixing classes. It is done thanks to a criterion like the entropy

index but this kind of Index does not assume any difference between

mixing class A and B or A and C. Assuming now that from an

operational point of view, if A and B are mixed in the predictions is

not critical, the algorithm should have preferred to mix and A and B

rather than A and C even if in the first case it will produce more

errors.

Therefore, the internal functioning of the AI algorithms should be

refined, here by defining a particular criterion to replace the

entropy as a quality measure when separating two branches. It

assumes that the final NM objectives are integrated at this stage.

Another concrete example is traffic predictors which aim at

forecasting traffic demands. They only produce an input that is not

necessarily simple to be interpreted and used by, e.g., capacity

allocation strategies/policies. A traditional traffic prediction

that tries to minimize (perfectly symmetric) MAE/MSE treats positive

and negative errors in identical ways, hence is agnostic of the

diverse meaning (and costs) of under- and over-provisioning. And,

such a prediction does not provide any information on, e.g., how to

dimension resources/capacity to accommodate the future demand

avoiding all underprovisioning (which entails service disruption)

while minimizing overprovisioning (i.e., wasting resources). In

other words, it forces the operator to guess the overprovisioning by

taking (non-informed) safety margins. A more sensible approach here

is instead forecasting directly the needed capacity, rather than the

traffic [Beg19].

While the one above is just an example, the high-level challenge is

devising forecasting models that minimize the correct objective/loss

function for the specific NM task at hand (instead of generic MAE/

MSE). In this way, the prediction phase becomes an integral part of

the NM, and not just a (limited and hard-to-use) input to it. In ML

terms, this maps to solving the loss-metric mismatch in the context

of anticipatory NM [Hua19].

Another issue for statistical learning (from examples/observations)

is mainly about extracting an estimator from a finite set of input-

output samples drawn from an unknown probability distribution that

should be descriptive enough for unseen/new input data. In this

context online monitoring and error control of the quality/properties

of these point estimators (bias, variance, mean squared error, etc.)

is critical for dynamic/uncertain network environments. Similar

reasoning/challenge applies for interval estimates, i.e., confidence

intervals (frequentist) and credible intervals (Bayesian).

6.3. Lightweight AI

Network management and operations often need to be performed under

strict time constraints, i.e. at line rate, in particular in the

context of autonomic or self-driven networks. Locating NM functions

as close as possible where forwarding is achieved is thus an

interesting option to avoid additional delays when these operations

are performed remotely, for example in a centralized controller.

Besides, forwarding devices may offer available resources to

supplement or replace edge resources. In case of AI coupled with

network management, AI tasks can be offloaded in network devices, or

more generally embedded within the network. Obviously, time-critical

tasks are the best candidates to be offloaded within the network.

Costly learning tasks should be processed in high-end servers but

created models can be deployed, configured, modified and tuned in

switches.

Recent advances in network programmability ease the programming of

specific tasks at data-plane level. P4 [Bos14] is widely used today

for many tasks including firewalling [Dat18] or bandwidth management

[Che19]. P4 is prone to be agnostic to a specific hardware.

Switches actually have particular architectures and the RMT

(Reconfigurable Match Table) [Bos13] model is generally accepted to

be generic enough to represent limited but essential switch

architecture components and functionalities. P4 is inspired by this

architecture. The RMT model allows reconfiguring match-action tables

where actions can be usual ones (rewrite some headers, forward,

drop...). Actions are thus applied on the packets when they are

forwarded. Actions can also be more complex programs with some

safeguards: no loop, resistivity... The impact on the program

development is huge. For example, real number operations are not

available by default while they are primordial in many AI algorithms.

In a nutshell, the first challenge to overcome of embedding AI in a

network is the capacity of the hardware to support AI operations

(architectural limitation). Considering software equipment such as a

virtual switch simplifies the problem but does not totally resolve it

as, even in that case, strong line-rate requirement limits the type

of programs to be executed. For example, BPF (Berkeley Packet

Filter) programs provides a higher control on packet processing in

OVS [Cha18] but still have some limitations, as the execution time of

these programs are bounded by nature to ensure their termination, an

essential requirement assuming the run-to-completion model which

permits high throughput.

The second challenge (resource limitation) of network-embedded AI in

the network is to allocate enough resources for AI tasks with a

limited impact on other tasks of network devices such as forwarding,

monitoring, filtering... Approximation and/or optimization of AI

tasks are potential directions to help in this area. For instance,

many network monitoring proposals rely on sketches and with a

proposed well-tuned implementation for data-plane [Liu16][Yan18].

However, no general optimized AI-programmable abstraction exists to

fit all cases and proposals are mostly use-case centric. Research

direction in NM regarding this issue can benefit from propositions in

the field of embedded systems that face the same issues.

Binarization of neural networks is one example [Lia18]. Besides,

distributed processing is a common technique to distribute the load

of a single task between multiple entities. AI task decomposition

between network elements, edge servers or controllers has been also

proposed [Gup18].

6.4. AI for planning of actions

Many tasks in network management revolve around the planning of

actions with the purpose of optimizing a network and facilitating the

delivery of communication services. For example, Paths need to be

planned and set up in ways that minimize wasted network resources (to

optimize cost) while facilitating high network utilization (avoiding

bottlenecks and the formation of congestion hotspots) and ensuring

resiliency (by making sure that backup paths are not congruent with

primary paths). Other examples were mentioned in section 2.

The need for planning only increases with the rise of centralized

control planes. The promise of central control is that decisions can

be optimized when made with complete knowledge of relevant context,

as opposed to distributed control that needs to rely on local

decisions being made with incomplete knowledge while incurring higher

overhead to replicate relevant state across multiple systems.

However, as the scale of networks and interconnected systems

continues to grow, so does the size of the planning task. Many

problems are NP-hard. As a result, solutions typically need to rely

on heuristics and algorithms that often result in suboptimal outcomes

and that are challenging to deploy in a scalable manner.

The emergence of Intent-Based Networking emphasizes the need for

automated planning even further. The concept underlying "intent" is

that it should allow users (network operators, not end users of

communication services) to articulate desired outcomes without the

need to specify how to achieve those outcomes. An Intent-Based

System is responsible for translating the intent into courses of

action that achieve the desired outcomes and that continue to

maintain the outcomes over time. How the necessary courses of action

are derived and what planning needs to take place is left open but

where the real challenge lies. Solutions that rely on clever

algorithms devised by human developers face the same challenges as

any other network management tasks.

These properties (problems with a clearly defined need, whose

solution is faced with exploding search spaces and that today rely on

algorithms and heuristics that in many cases result only suboptimal

outcomes and significant limitations in scale) make automated

planning of actions an ideal candidate for the application of AI-

based solutions.

AI applications in network management in the past have been largely

focusing on classification problems. Examples include analysis by

Intrusion Protection Systems of traffic flow patterns to detect

suspicious traffic, classification of encrypted traffic for improved

QoS treatment based on suspected application type, and prediction of

performance parameters based on observations. In addition, AI has

been used for troubleshooting and diagnostics, as well as for

automated help and customer support systems. However, AI-based

solutions for the automated planning of actions, including the

automated identification of courses of action, have to this point not

been explored much.

A much-publicized leap in AI has been the development of Alpha Go.

Instead of using AI to merely solve classification problems, Alpha Go

has been successful in automatically deriving winning strategy for

board games, specifically the game of Go which features a

prohibitively large search space that was long thought to put the

ability to play Go at a world class level beyond the reach of

problems that AI could solve. Among the remarkable aspects of Alpha

Go is that it is able to identify winning strategies completely on

its own, without needing those strategies to be taught or learned by

observations assuming the system is aware of rules.

The challenge for AI in network management is hence, where is the

equivalent of an Alpha Go that can be applied to network management

(and networking) problems? Specifically, better solutions are needed

for solutions that automatically derive plans and courses of actions

for network optimization and similar NP-hard problems, such as

provided today with only limited effectiveness by controllers and

management applications.

Also, the evaluation of AI algorithms to derive courses of actions is

more complex than more common regression or classification tasks.

Actions need to be applied in order to observe the results it leads

to. However, contrary to game playing, solutions need to be applied

in the real world, where actions have real effects and consequences.

Different orientations can be envisioned. First, incremental

application of AI decisions with small steps can allow us to

carefully observe and detect unexpected effects. This can be

complemented with roll-back techniques. Second, formal verification

techniques can be leveraged to verify decisions made by AI are

maintained within safety bounds. Third, sandbox environments can be

used but they SHOULD be representative enough of the real world.

After progress in simulation and emulation, recent research advances

lead to the definition of digital twins which implies a tight

coupling between a real system and its digital twin to ensure a

parallel but synchronized execution. Alternatively, transfer

learning techniques in another promising area to be able to

capitalize on ML models applicable on a real word system in a more

generic sandbox environment. It is actually also an open problem to

make the use of AI more acceptable as highlighted in the dedicated

section.

7. Network data as input for ML algorithms

Many applications of AI takes as input data. The quality of the

outputs of ML-based techniques are highly dependent on the quality

and quantity of data used for learning but also on other parameters.

For example, as modern network infrastructures move towards higher

speed and scale, they aim to support increasingly more demanding

services with strict performance guarantees. These often require

resource reconfigurations at run time, in response to emerging

network events, so that they can ensure reliable delivery at the

expected performance level. Timely observation and detection of

events is also of paramount importance for security purposes, and can

allow faster execution of remedy actions thus leading to reduced

service downtime.

Thus, the challenge of data management is multifaceted as detailed in

next subsections.

7.1. Data for AI-based NM solutions

Assuming a network management application, the first problem to

address is to define the data to be collected which will be

appropriate to obtain accurate results. This data selection can

require defining problem-specific data or features (feature

engineering).

Firstly, NM has already produced a lot of methods and technologies to

acquire data. However, in most cases, the goal was not to support AI

problems and lead to a mismatch. Indeed, machine learning algorithms

only work as desired when data to be analyzed respects properties.

Many methods rely on vector-based distances which so supposes that

the data encoded into the vector respects the underlying distance

semantic. Taking the first n bytes of a packet as vectors and

computing distances accordingly is possible but does not embed the

semantic of the information carried out in the headers. For example,

(deep) learning techniques mostly rely on vectors of (real) numbers

as input which fits some metrics (packet/byte counts, latency,

delays, etc) but needs some adjustment for categorical (IP addresses,

port numbers, etc) or topological features. Conversions are usually

applied using common techniques like one-hot encoding or by coarse-

grained representations [Sco11]. However, more advanced techniques

have been recently proposed to embed representation of network

entities rather than pure encoding [Rin17][Evr19][Sol20]. Data to

handle can be in a schema-free or eventually text-based format. One

example could be the automated annotation of management intents

provided in an unstructured textual format (policies descriptions,

specifications,) to extract from them management entities and

operations. For that purpose, suitable annotation models need to be

built using existing NER (Named Entity Recognition) techniques

usually applied for NLP. However, this SHALL be carefully crafted or

specialized for network management (intent) language which indirectly

bounces back to the challenges of AI techniques for NM specified

earlier.

Secondly, The behavior of any network is not just derived from the

events that can be directly observed, such as network traffic

overload, but also from events occurring outside the environment of

the network. The information provided by the detectors of such kinds

of events, e.g. a natural incident (earthquake, storm), can be used

to determine the adaptation of the network to avoid potential

problems derived from such events. Those can be provided by BigData

sources as well as sensors of many kinds. The AI challenge related

to this task is to process large amounts of data and associate it

with the effects that those events have on the network. It is hard

to determine the static and dynamic relation between the data

provided by external sources and the specific implications it has in

networks. For instance, the effect of a "flash crowd" detected in an

external source depends on the relation of a particular network to

such an event. This can be addressed by AI and its particular

application to network management. The objective is to complement a

control-loop, as shown in [Mar18], by including the specific AI

engines into the decision components as well as the processes that

close the loop, so the AI engine can receive feedback from the

network in order to improve its own behavior. Similar challenges are

addressed in other domains, image processing and computer vision, by

using artifacts for anticipating movements in object location and

identification.

7.2. Data collection

Once defined, the second problem to address is the collection of

data. Monitoring frameworks have been developed for many years such

as IPFIX [RFC7011] and more recently with SDN-based monitoring

solutions [Yu14][Ngu20]. However, going towards more AI for actions

in network management supposes also to retrieve more than traffic

related information. Actually, configuration information such as

topologies, routing tables or security policies have been proven to

be relevant in specific scenarios. As a result, many different

technologies can be used to retrieve meaningful data. To support

improved QoE, monitoring of the application layer is helpful but far

from being easy with the heterogeneity of end-user applications and

the wide use of encrypted channels. Monitoring techniques need to be

reinvented through the definition of new techniques to extract

knowledge from raw measurement [Bri19] or by involving end-users with

crowd-sourcing [Hir15] and distributed monitoring.

The collecting process requirements depend on the kind of processing.

We can distinguish two major classes: batch/offline vs real-time/

online processing. In particular, real-time monitoring tools are key

in enabling dynamic resource management functions to operate on short

reconfiguration cycles. However, maintaining an accurate view of the

network state requires a vast amount of information to be collected

and processed. While efficient mechanisms that extract raw

measurement data at line rate have been recently developed, the

processing of collected data is still a costly operation. This

involves evaluating and aggregating a vast amount of state

information as a response to a diverse set of monitoring queries,

before generating accurate reports. Machine learning methods, e.g.

based on regression, can be used to intelligently filter the raw

measurements and thus reduce the volume of data to process. For

example, in [Tan20] the authors proposed an approach in which the

classifiers derived for this purpose (according to measurements on

traffic properties) can achieve a threefold improvement in the query

processing capability. A residual question is the storage of raw

measurements. In fact, predicting the lifetime of data is

challenging because their analysis may not be planned and triggered

by a particular event (for example, an anomaly or attack). As a

result, the provisioning of storage capacity can be hard.

In parallel to the continuously increasing dynamicity of networks and

complexity of traffic, there is a trend towards more user traffic

processing customization [RFC8986][Li19]. As a result, fine grained

information about network element states is expected and new

propositions have emerged to collect on-path data or in-band network

telemetry information [Tan20b]. These new approaches have been

designed by introducing much flexibility and customization and could

be helpful to be used in conjunction with AI applications. However,

the seamless coupling of telemetry processes with packet forwarding

requires careful definition of solutions to limit the overhead and

the impact of the throughput while providing the necessary level of

details. This shares commonalities with the lightweight AI

challenge.

7.3. Usable data

Although all agree on the necessity to have more shared datasets, it

is quite uncommon in practice. Data contains private or sensitive

information and may not be shared because of the criticality of data

(which can be used by ill-intentioned adversaries) or due to laws or

regulations, even within the same company. To solve this issue,

anonymization techniques [Dij19] can be enhanced to optimize the

trade-off between valuable data vs sensitive information (potential)

leakage or reconstruction. Whatever the final user of data,

regulations and laws impose rules on data management with potentially

costly impact if they are not respected voluntarily or not. Defining

a new monitoring framework should always consider security and

privacy aspects, for example to let any user/customer or access/

remove its own data with General Data Protection Regulation (GDPR) in

EU. The challenge resides here in the capacity of qualifying what is

critical or private information and the capacity for an adversary to

reconstruct it from other sources of data. Hence AI/ML based

solutions will require more data but also more administrative, legal

and ethical procedures. Those can last long and so slow down the

deployment of a new solution. In addition, this requires interaction

with experts from different domains (e.g. AI engineer and a lawyer).

The integration of these non-technical constraints should be

considered when defining new data to be collected or a new technique

to collect data. However, knowing the final use of data is most of

the time necessary for ethical and legal assessment which assumes

that those considerations SHOULD be integrated from the early design

of new AI-based solutions.

For supervised or semi-supervised training, having a labeled dataset

is a prerequisite. It constitutes a major challenge as well. One

one hand, collectors are able to retrieve data. On the other hand,

those network data are typically unlabeled. This limits application

of ML to unsupervised learning tasks (learning from data). Because

manual labeling is a tedious task. one option is to leverage AI to

guide humans. This may also support a better generalization of a

learned model. Indeed, an underlying challenge is the genericity or

coverage of the datasets. Labels encode values of an objective

function, the challenge posed by the design of such tools is

tremendous since for involving a M:N relationship: 1 data type may be

associated to M objective function values and N data types may be

associated to 1 objective function. As a result, most datasets used

for research encodes a single label for a particular application like

attack label for datasets to be used in the context of intrusion

detection or application type for network traffic used for

classification where the value of a single dataset could be

capitalized in several applications.

Again, researchers need empirical (or at least realistic) datasets to

validate their solutions. Unfortunately, as highlighted above,

having such data from real deployments for various reasons (business

secrets, privacy concerns, concerns that vulnerabilities are revealed

by accident, raw unlabeled data, etc.) is tough. Even if such a

dataset is available it might not be enough to convincingly validate

a new algorithm. Instead of falling back to artificial testbed

experiments or simulation, it would be useful to have the capability

to generate datasets with characteristics that are not 100% identical

but similar to the characteristics of one or more real datasets.

Such synthetic networks can be used to validate new management

algorithms, intrusion detection systems, etc. The usage of AI (for

example GANs) in this area [Hui22] is not yet widespread and there

are still many concerns that deter researchers, e.g. the fear of

leaking sensitive information from the original dataset into the

synthetic dataset.

8. Acceptability of AI

Networks are critical infrastructures. On one hand, they SHOULD be

operated without interruption and must be interoperable. Networks,

except in a lab, are not isolated which slow down innovation in

general. For example, changing Internet routing protocols SHOULD be

accepted by all. The same applies for protocol. Even if there have

been several versions of major protocols in use like TCP or DNS,

there are still some security issues which cannot be patched with

100% guarantee. On the other hand, results provided by AI solutions

are uncertain by nature. The same technique applied in different

environments can produce different results. AI techniques need some

effort (time and human) to be properly configured or to be

stabilized. For instance, reinforcement learning needs several

iterations before being able to produce acceptable results. These

properties of AI techniques are thus a bit antagonist with the

criticality of network infrastructures. With that in mind,

acceptability of AI by network operators is clearly an obstacle for

its larger adoption.

8.1. Explainability of Network-AI products

A common issue across all Machine Learning (ML) applications is that

they are black boxes. This means that, after training, the knowledge

acquired by ML models is unintelligible to humans. As a result,

offering hard guarantees on performance is a very challenging issue.

In addition, complex ML models like neural networks -that often have

more than hundreds of thousands of parameters- are very hard to debug

or troubleshoot in case of failure.

While this is a common issue for all applications of AI, many areas

work well with uncertainty and the black-box behavior of AI-based

solutions. For instance, users accept an inherent error in

recommender systems or computer vision solutions.

The networking field has already produced a set of well-established

network management algorithms and methods, with clear performance

guarantees and troubleshooting mechanisms [Rex06][Kr14]. As such,

improving debugging, troubleshooting and guarantees on AI-based

solutions for networking is a must.

AI researchers and practitioners are devoting large research efforts

to improve this aspect of ML models, which is commonly known as

explainability [XAI].

This set of techniques provides insights and, in some cases,

guarantees on the performance and behavior of ML-based solutions.

Understanding such techniques, researching and applying them to

network AI is critical for the success of the field.

There exist several ML-based methods that are human-understandable,

although not widely used today. For instance, [Mar20] shows a method

for building anticipation models (prediction) that provide

explanations while determining some actions for tuning some

parameters of the network. There are other challenges that SHOULD be

addressed, such as providing explanations for other ML methods that

are quite extended. For instance, xNN/SVM models can be accompanied

by Digital Twins of the network that are reversely explored to

explain some output from the ML model (e.g., xNN/SVM). In this

context, there already exist several methods [Zil20][Puj21] that

produce human-readable interpretations of trained NN models, by

analyzing their neural activations on different inputs.

8.2. AI-based products and algorithms in production systems

AI-based network management and optimization algorithms are first

trained, then the resulting model is used to produce relevant

inferences in operation, either in management or optimization

scenarios. A relevant question for the success of AI-based solutions

is: where does this training occur?

Traditionally, AI-based models have been trained in the same scenario

where they operate[Val17][Xu18], this is the customer network.

However this presents critical drawbacks. First, training an AI

model for management and operation typically requires generating

network configurations and scenarios that can break the network.

This is because training requires seeing a broad spectrum of

scenarios. Thus, it is not feasible in production networks. Second,

customer networks may not be equipped with the monitoring

infrastructure required to collect the data used in the training

process (e.g., performance metrics).

A more sensible approach is to train the AI-based product in a lab,

for instance in the vendor’s premises. In the lab, AI models can be

trained in a controlled testbed, with any configuration, even ones

that break the network. However, the main challenge here arises from

the fundamental differences between the lab’s network and the

customer networks. For instance, the topology of the lab’s network

might be smaller, etc. As a result, there is a need for models that

are able to generalize. In this context, generalization means that

models should be able to operate in other scenarios not seen during

training, with different topologies, routing configurations,

scheduling policies, etc.

In order to address this generalization problem, two main approaches

are possible: The first one is Transfer Learning [tl1]. With this

technique, the knowledge gained in the lab’s training is used to

operate in the customer network. Transfer Learning still requires

that some data from the customer is used to re-train the model (e.g.,

accurate performance measurements). This means that, for each

customer network, re-training is required. This presents important

drawbacks, since this represents an added cost and access to customer

data might be problematic.

A different approach is to use Graph Neural Networks (GNN)

[gnn1][gnn2]. GNNs are a novel type of neural network able to

operate and generalize over graphs. Indeed, networks are

fundamentally represented as graphs: topology, routing, etc. With

GNN, vendors can train the AI model in a lab and then use the

resulting model, as is, in different customer networks, without

additional re-training using customer data.

8.3. AI with humans in the loop

Depending on the network management task, AI can automate and replace

manual human control or it can complement human experts and keep them

in the loop. Keeping humans in the loop will be an important step of

building trust in AI approaches and help ensure the desired outcomes.

There are various ways of keeping humans in the loop in the different

fields of AI, which could be useful for different aspects of network

management.

In classification tasks (e.g., detecting security breaches, malware

or detecting anomalies), trained AI models provide a confidence score

in addition to the predicted class. If the confidence is high, the

prediction is used directly. If the confidence is too low, a human

expert may jump in and make the decision - thereby also providing

valuable training data to improve the AI model. Such approaches are

already being used in industry, e.g., to automatically label datasets

(AWS SageMake). Similar approaches could also be used for other

supervised learning tasks, e.g., regression. Still, it is an open

challenge to keep humans in the loop in all phases of the learning

process.

Another field of AI is reinforcement learning, which is useful for

taking continuous control decisions in network management, e.g.,

controlling service scaling and placement as well as flow scheduling

and routing over time. Reinforcement learning agents typically

interact with the environment (i.e., the simulated or real network)

completely autonomously without human feedback. However, there is a

growing number of approaches to put human experts back into the loop.

One approach is offline reinforcement learning, where the training

data does not come from the reinforcement learning agent’s own

exploration but from pre-recorded traces of human experts (e.g.,

placement decisions that were made by humans before). Another

approach is to reward the reinforcement learning agent based on human

feedback rather than a pre-defined reward function [Lee21]. Again,

while there are first promising approaches, more work is required in

this area. Overall, it is an open challenge to both leverage the

benefits of AI but keep human experts in the loop where it is useful.

TODO Security

This document has no IANA actions.

11. References

11.1. Normative References

[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate

Requirement Levels", BCP 14, RFC 2119,

DOI 10.17487/RFC2119, March 1997,

[RFC7011] Claise, B., Trammell, B., and P. Aitken, "Specification of

the IP Flow Information Export (IPFIX) Protocol for the

Exchange of Flow Information", STD 77, RFC 7011,

[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC

2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,

May 2017, <https://www.rfc-editor.org/info/rfc8174>.

[RFC8986] Filsfils, C., Ed., Camarillo, P., Ed., Leddy, J., Voyer,

D., Matsushima, S., and Z. Li, "Segment Routing over IPv6

(SRv6) Network Programming", RFC 8986,

DOI 10.17487/RFC8986, February 2021,

11.2. Informative References

[Abd10] Jalil, K. A., Kamarudin, M. H., and M. N. Masrek, "A

Diagnosis Expert System for Network Traffic Management",

2010. IEEE international conference on networking and

information technology

[Beg19] Bega, D., Gramaglia, M., Fiore, M., Banchs, A., and X.

Costa-Perez, "DeepCog: Cognitive Network Management in

Sliced 5G Networks with Deep Learning", 2019. IEEE

INFOCOM

[Bos13] Bosshart, P., Gibb, G., Kim, H.-S., Varghese, G., McKeown,

N., Izzard, M., Mujica, F., and M. Horowitz, "Forwarding

metamorphosis: Fast programmable match-action processing

in hardware for SDN", 2013. ACM SIGCOMM

[Bos14] Bosshart, P., Daly, D., Gibb, G., Izzard-, M., McKeown,

N., Rexford, J., Schlesinger, C., Talayco, D., Vahdat, A.,

Varghese, G., and D. Walker, "P4: programming protocol-

independent packet processors", 2014. SIGCOMM Comput.

Commun. Rev. 44

[Bou18] Boutaba, R., Salahuddin, M. A., Limam, N., Ayoubi, S.,

Shahriar, N., Estrada-Solano, F., and O. M. Caicedo, "A

comprehensive survey on machine learning for networking:

evolution, applications and research opportunities", 2018.

Journal of Internet Services and Applications 9, 16

[Bri19] Brissaud, P.-O., François, J., Chrisment, I., Cholez, T.,

and O. Bettan, "Transparent and Service-Agnostic

Monitoring of Encrypted Web Traffic", 2019. IEEE

Transactions on Network and Service Management, 16 (3)

[Cha18] Chaignon, P., Lazri, K., François, J., Delmas, T., and O.

Festor, "Oko: Extending Open vSwitch with Stateful

Filters", 2018. ACM Symposium on SDN Research (SOSR)

[Che19] Chen, Y., Yen, L., Wang, W., Chuang, C., Liu, Y., and C.

Tseng, "P4-Enabled Bandwidth Management", 2019. Asia-

Pacific Network Operations and Management Symposium

(APNOMS)

[czb20] Clemm, A., Zhani, M. F., and R. Boutaba, "Network

Management 2030: Operations and Control of Network 2030

Services", 2020. Springer Journal of Network and Systems

Management (JNSM)

[Dat18] Datta, R., Choi, S., Chowdhary, A., and Y. Park,,

"P4Guard: Designing P4 Based Firewall", 2018. IEEE

Military Communications Conference (MILCOM)

[Dij19] Dijkhuizen, N. V., Ham, J. V. D., and X. Li, "A Survey of

Network Traffic Anonymisation Techniques and

Implementations", 2014. ACM Comput. Surv. 51, 3, Article

[Evr19] Evrard, L., François, J., Colin, J.-N., and F. Beck,

"port2dist: Semantic Port Distances for Network

Analytics", 2019. IFIP/IEEE Symposium on Integrated

Network and Service Management (IM)

[gnn1] Battaglia, P. W. and E. al, "Relational inductive biases,

deep learning, and graph networks", 2018. arXiv preprint

arXiv:1806.01261

[gnn2] Rusek, K., Suárez-Varela, J., Mestres, A., Barlet-Ros, P.,

and A. Cabellos-Aparicio, "Unveiling the potential of

Graph Neural Networks for network modeling and

optimization in SDN", 2019. ACM Symposium on SDN Research

[Gup18] Gupta, A., Harrison, R., Canini, M., Feamster, N.,

Rexford, J., and W. Willinger, "Sonata: query-driven

streaming network telemetry", 2018. ACM SIGCOMM

Conference

[Hir15] Hirth, M., Hossfeld, T., Mellia, M., Schwartz, C., and F.

Lehrieder, "Crowdsourced network measurements: Benefits

and best practices", 2015. Computer Networks. 90

[Hoo18] Hooft, J. V. D., Claeys, M., Bouten, N., Wauters, T.,

Schönwälder, J., Stiller, A. P. B., Charalambides, M.,

Badonnel, R., Serrat, J., Santos, C. R. P. D., and F. D.

Turck, "Updated Taxonomy for the Network and Service

Management Research Field", 2018. Journal of Network

System Managemen (JNSM) 26, 790-808

[Hua19] Huang, C., Zhai, S., Talbott, W., Bautista, M. A., Sun,

S.-Y., Guestrin, C., and J. Susskind, "Addressing the

Loss-Metric Mismatch with Adaptive Loss Alignment", 2020.

[Hui22] Hui, S., Wang, H., Wang, Z., Yang, X., Liu, Z., Jin, D.,

and Y. Li, "Knowledge Enhanced GAN for IoT Traffic

Generation", 2022. ACM Web Conference 2022 (WWW)

[Kaf19] Kafle, V. P., Martinez-Julia, P., and T. Miyazawa,

"Automation of 5G Network Slice Control Functions with

Machine Learning", 2019. IEEE Communications Standards

Magazine, vol. 3, no. 3, pp. 54-62

[Kr14] Kreutz, D., Ramos, F. M., Verissimo, P. E., Rothenberg, C.

E., Azodolmolky, S., and S. Uhlig, "Software-defined

networking: A comprehensive survey", 2015. Proceedings of

the IEEE, vol. 103, no. 1, pp. 14-76

[Lee21] Lee, K., Smith, L., and P. Abbeel, "Feedback-efficient

interactive reinforcement learning via relabeling

experience and unsupervised pre-training", 2021. arXiv

preprint arXiv:2106.05091

[Li19] Li, R., Makhijani, K., Yousefi, H., Westphal, C., Dong,

L., Wauters, T., and F. D. Turck., "A Framework for

Qualitative Communications Using Big Packet Protocol",

2019. ACM SIGCOMM Workshop on Networking for Emerging

Applications and Technologies (NEAT)

[Lia18] Liang, S., Yin, S., Liu, L., Luk, W., and S. Wei, "FP-BNN:

Binarized neural network on FPGA", 2018. Neurocomputing,

Volume 275

[Liu16] Liu, Z., Manousis, A., Vorsanger, G., Sekar, V., and V.

Braverman, "One Sketch to Rule Them All: Rethinking

Network Flow Monitoring with UnivMon", 2016. ACM SIGCOMM

Conference

[Lop20] López, J., Labonne, M., Poletti, C., and D. Belabed,

"Priority Flow Admission and Routing in SDN: Exact and

Heuristic Approaches", 2020. IEEE International Symposium

on Network Computing and Applications (NCA)

[Mar18] Martinez-Julia, P., Kafle, V. P., and H. Harai,

"Exploiting External Events for Resource Adaptation in

Virtual Computer and Network Systems", 2018. IEEE

Transactions on Network and Service Management, Vol. 15,

[Mar20] Martinez-Julia, P., Kafle, V. P., and H. Asaeda,

"Explained Intelligent Management Decisions in Virtual

Networks and Network Slices", 2020. Conference on

Innovation in Clouds, Internet and Networks and Workshops

(ICIN)

[Mus18] Musumeci, F., Rottondi, C., Nag, A., Macaluso, I., Zibar,

D., Ruffini, M., and M. Tornatore, "An overview on

application of machine learning techniques in optical

networks", 2018. IEEE Communications Surveys & Tutorials,

21(2), 1383-1408.

[Ngu20] Nguyen, T. G., Phan, T. V., Hoang, D. T., Nguyen, T. N.,

and C. So-In, "Efficient SDN-based traffic monitoring in

IoT networks with double deep Q-network", 2020.

International conference on computational data and social

networks, Springer

[Puj21] Pujol-Perich, D., Suárez-Varela, J., Xiao, S., Wu, B.,

Cabello, A., and P. Barlet-Ros, "NetXplain: Real-time

explainability of Graph Neural Networks applied to

Computer Networks", 2021. MLSys workshop on Graph Neural

Networks and Systems (GNNSys)

[Rex06] Rexford, J., "Route optimization in IP networks", 2006.

Handbook of Optimization in Telecommunications (pp.

679-700), Springer

[Rin17] Ring, M., Dallmann, A., Landes, D., and A. Hotho, "IP2Vec:

Learning Similarities Between IP Addresses", 2017. IEEE

International Conference on Data Mining Workshops (ICDMW)

[Sco11] Coull, S. E., Monrose, F., and M. Bailey, "On Measuring

the Similarity of Network Hosts: Pitfalls, New Metrics,

and Empirical Analyses", 2011. NDSS

[Sen04] Sen, S., Spatscheck, O., and D. Wang, "Accurate, scalable

in-network identification of p2p traffic using application

signatures", 2004. ACM International conference on World

Wide Web (WWW)

[Sol20] Soliman, H. M., Salmon, G., Sovilij, D., and M. Rao, "A

Graph Neural Network Approach for Scalable and Dynamic IP

Similarity in Enterprise Networks", 2020. IEEE

International Conference on Cloud Networking (CloudNet)

[Ste92] Stern, D. and P. Chemouil, "A Diagnosis Expert System for

Network Traffic Management", 1992. Networks, Kobe, Japan

[Tan20] Tangari, G., Charalambides, M., Pavlou, G., Grazian, C.,

and D. Tuncer, "Classification-assisted Query Processing

for Network Telemetry", 2020. Network Traffic Measurement

and Analysis Conference (TMA)

[Tan20b] Lizhuang, T., Wei, S., Zhenyi, Z., Jingying, M., Xiaoxi,

L., and L. Na, "In-band Network Telemetry: A Survey",

2020. Computer Networks. 186. 10.1016

[tl1] Torrey, L. and J. Shavlik, "Transfer learning", 2010.

Handbook of research on machine learning applications and

trends: algorithms, methods, and techniques

[Val17] A., V., M., S., D., S., and T. A., "Learning to route",

2017. ACM HotNets

[XAI] Samek, W., Wiegand, T., and K.-R. Müller, "Explainable

artificial intelligence: Understanding, visualizing and

interpreting deep learning models", 2017. arXiv preprint

arXiv:1708.08296

[Xie18] Xie, J., Yu, F. R., Huang, T., Xie, R., Liu, J., Wang, C.,

and Y. Liu, "A survey of machine learning techniques

applied to software defined networking (SDN): Research

issues and challenges", 2018. IEEE Communications Surveys

& Tutorials

[Xu18] Z., X., J., T., J., M., W., Z., Y., W., H., L. C., and Y.

D., "Experience-driven networking: A deep reinforcement

learning based approach", 2018. IEEE INFOCOM

[Yan18] Yang, T., Jiang, J., Liu, P., Huang, Q., Gong, J., Zhou,

Y., Miao, R., Li, X., and S. Uhlig, "Elastic sketch:

adaptive and fast network-wide measurements", 2018. ACM

SIGCOMM Conference

[Yan20] Yang, H., Alphones, A., Xiong, Z., Niyato, D., Zhao, J.,

and K. Wu,, "Artificial-Intelligence-Enabled Intelligent

6G Networks", 2020. IEEE Network, vol. 34, no. 6, pp.

272-280

[Yu14] Yu, Y., Qian, C., and X. Li, "Distributed and

collaborative traffic monitoring in software defined

networks", 2014. ACM Hot topics in software defined

networking

[Zil20] Meng, Z., Wang, M., Bai, J., Xu, M., Mao, H., and H. Hu,

"Interpreting Deep Learning-Based Networking Systems",

2020. ACM SIGCOMM

Acknowledgments

This document is the result of a collective work. Authors of this

document are the main contributors and the editors but contributions

have been also received from the following people we acknowledge:

Laurent Ciavaglia, Felipe Alencar Lopes, Abdelkader Lahamdi, Albert

Cabellos, Jose Suarez-Varela, Marinos Charalambides, Ramin Sadre,

Pedro Martinez-Julia and Flavio Esposito

This document is also partially supported by project AI@EDGE, funded

from the European Union’s Horizon 2020 H2020-ICT-52 call for

projects, under grant agreement no. 101015922.

Jérôme François

615 rue du jardin botanique

Villers-lès-transparency

France

Email: jerome.francois@inria.fr

Alexander Clemm

Futurewei Technologies, Inc.

United States of America

Email: alex@clemm.org

Dimitri Papadimitriou

Greece

Email: papadimitriou.dimitri.be@gmail.com

Stenio Fernandes

Central Bank of Canada

Canada

Email: steniofernandes@gmail.com

Stefan Schneider

Digital Railway (DSD) at Deutsche Bahn

Germany

Email: stefanschneider93@googlemail.com

Network Working Group Y-G. HongInternet-Draft Daejeon UniversityIntended status: Informational S-B. OhExpires: January 12, 2023 KSA J-S. Youn DONG-EUI Univ S-J. Lee Korea University/KT H-K. Kahng Korea University July 11, 2022

Considerations of deploying AI services in a distributed approach draft-hong-nmrg-ai-deploy-01

Abstract

As the development of AI technology matured and AI technology began to be applied in various fields, AI technology is changed from running only on very high-performance servers with small hardware, including microcontrollers, low-performance CPUs and AI chipsets. In this document, we consider how to configure the system in terms of AI inference service to provide AI service in a distributed approach. Also, we describe the points to be considered in the environment where a client connects to a cloud server and an edge device and requests an AI service.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

Hong, et al. Expires January 12, 2023 [Page 1]

Internet-Draft Deploying AI services July 2022

Copyright Notice

This document is subject to BCP 78 and the IETF Trust’s Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

Table of Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Procedure to provide AI services . . . . . . . . . . . . . . 4 3. Network configuration structure to provide AI services . . . 5 3.1. AI inference service on Local machine . . . . . . . . . . 6 3.2. AI inference service on Cloud server . . . . . . . . . . 6 3.3. AI inference service on Edge device . . . . . . . . . . . 7 3.4. AI inference service on Cloud server and Edge device . . 8 4. Considerations when configuring a system to provide AI services . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 6. Security Considerations . . . . . . . . . . . . . . . . . . . 12 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12 8. Informative References . . . . . . . . . . . . . . . . . . . 12 Authors’ Addresses . . . . . . . . . . . . . . . . . . . . . . . 13

1. Introduction

In the Internet of Things (IoT), the amount of data generated from IoT devices has exploded along with the number of IoT devices due to industrial digitization and the development and dissemination of new devices. Various methods are being tried to effectively process the explosively increasing IoT devices and data of IoT devices. One of them is to provide IoT services in a place located close to IoT devices and users, away from cloud computing that transmits all data generated from IoT devices to a cloud server[I-D.irtf-t2trg-iot-edge].

IoT services also started to break away from the traditional method of analyzing IoT data collected so far in the cloud and delivering the analyzed results back to IoT objects or devices. In other words, AIoT (Artificial Intelligence of Things) technology, a combination of

IoT technology and artificial intelligence (AI) technology, started to be discussed at international standardization organizations such as ITU-T. AIoT technology, discussed by the ITU-T CG-AIoT group, is defined as a technology that combines AI technology and IoT infrastructure to achieve more efficient IoT operations, improve human-machine interaction, and improve data management and analysis[CG-AIoT].

The first work started by the IETF to apply IoT technology to the Internet was to research a lightweight protocol stack instead of the existing TCP/IP protocol stack so that various types of IoT devices, not traditional Internet terminals, could access the Internet. It was a technology that made it possible to connect to the Internet[RFC6574][RFC7452]. These technologies have been developed by 6LoWPAN working group, 6lo working group, 6tisch working group, core working group, t2trg group, etc. As the development of AI technology matured and AI technology began to be applied in various fields, just as IoT technology was mounted on resource-constrained devices and connected to the Internet, AI technology is also changed from running only on very high-performance servers with the old GPU installed. The technology is being developed to run on small hardware, including microcontrollers, low-performance CPUs and AI chipsets. This technology development direction is called On-device AI or TinyML[tinyML].

In this document, we consider how to configure the system in terms of AI inference service to provide AI service in the IoT environment. In the IoT environment, the technology of collecting sensing data from various sensors and delivering it to the cloud has already been studied by many standardization organizations including the IETF and many standards have been developed. Now, after creating an AI model to provide AI services based on the collected data, how to configure this AI model as a system has become the main research goal. Until now, it has been common to develop AI services that collect data and perform inferences from the trained servers, but in terms of the spread and spread of AI services, it is not appropriate to use expensive servers to provide AI services. In addition, since the server that collects and trains data mainly exists in the form of a cloud server, there are also many problems in proceeding in the form of requesting AI service by connecting a large number of terminals to these cloud servers to provide AI services. Therefore, when an AI service is requested to an edge device located at a close distance, it may have effects such as real-time service support, network traffic reduction, and important data security rather than requesting an AI service to an AI server located in a distant cloud.[I-D.irtf-t2trg-iot-edge]

Even if an edge device is used to serve AI services, it is still important to connect to an AI server in the cloud for tasks that take a lot of time or require a lot of data. Therefore, an offloading technique for properly distributing the workload between the cloud server and the edge device is also a field that is being actively studied. In this contribution, in the following proposed network structure, the points to be considered in the environment where a client connects to a server and an edge device and requests an AI service are derived and described. That is, the following considerations and options could be derived.

o AI inference service execution entity

o Hardware specifications of the machine to perform AI inference services

o Selection of AI models to perform AI inference services

o A method of providing AI services from cloud servers or edge devices

o Communication method to transmit data to request AI inference service

2. Procedure to provide AI services

Since research on AI services has been started for a long time, there may be shapes to provide various types of AI services. However, due to the nature of AI technology, in general, a system for providing AI services consists of the following steps[AI_inference_archtecture][Google_cloud_iot].

+-----------+ +-----------+ +-----------+ +-----------+ +-----------+| Collect & | | Analysis &| | Train | | Deploy & | | Monitor & || Store |->| Preprocess|->| AI model |->| Inference |->| Maintain || data | | data | | | | AI model | | Accuracy |+-----------+ +-----------+ +-----------+ +-----------+ +-----------+|<--------->| |<------------------------>| |<--------->| |<--------->| Sensor, DB AI Server Target AI Server & machine Target machine|<---------------->|<--------------------->|<-------------->|<--------->| Interent Local Internet Local & Internet

Figure 1: AI service workflow

o Data collection & Store

o Data Analysis & Preprocess

o AI Model Training

o AI Model Deploy & Inference

o Monitor & Maintain Accuracy

In the data collection step, data required for training is prepared by collecting data from sensors and IoT devices or by using data stored in a database. Equipment involved in this step includes sensors, IoT devices and servers that store them, and database servers. Since the operations performed at this step are conducted through the Internet, many IoT technologies studied by the IETF so far have developed technologies suitable for this step.

In the data analysis and pre-processing step, the features of the prepared data are analyzed and pre-processing for training is performed. Equipment involved in this step includes a high- performance server equipped with a GPU and a database server, and is mainly performed in the local network.

In the model training step, a training model is created by applying an algorithm suitable for the characteristics of the data and the problem to be solved. Equipment involved in this step includes a high-performance server equipped with a GPU, and is mainly performed on a local network.

In the model deploying and inference service provision step, the problem to be solved (e.g., classification, regression problem) is solved using AI technology. Equipment involved in this step may include a target machine, a client, a cloud, etc. that provide AI services, and since various equipment is involved in this stage, it is conducted through the Internet. This document summarizes the factors to be considered at this step.

In the accuracy monitoring step, if the performance deteriorates due to new data, a new model is created through re-training, and the AI service quality is maintained by using the newly created model. This step is the same as described in the model training, model deploying, and inference service provision steps described in the previous step because re-training and model deploying are performed again.

3. Network configuration structure to provide AI services

In general, after training the AI model, the AI model can be built on a local machine for AI model deploying and inference services to provide AI services. Alternatively, we can place AI models on cloud

servers or edge devices and make AI service requests remotely. In addition, for overall service performance, some AI service requests to the cloud server and some AI service requests to edge devices can be performed through appropriate load balancing.

3.1. AI inference service on Local machine

The following figure shows a case where a client module requesting AI service on the same local machine requests AI service from an AI server module on the same machine.

+---------------------------------------------------------------------+ | | | +-----------------+ Request AI +-----------------+ | | | Client module | Inference service | Server module | | | | for AI service |----------------------->| for AI service | | | | |<-----------------------| | | | +-----------------+ Reply AI +-----------------+ | | Inference result | +---------------------------------------------------------------------+ Local machine

Figure 2: AI inference service on Local machine

This method is often used when configuring a system focused on training AI models to improve the inference accuracy and performance of AI models without considering AI services or AI model deploying and inference in particular. In this case, since the client module that requests the AI inference service and the AI server module that directly performs the AI inference service are on the same machine, it is not necessary to consider the communication/network environment or service provision method too much. Alternatively, this method can be used when we want to simply decorate the AI inference service on one machine without changing the AI service in the future, such as an embedded machine or a customized machine.

In this case, a high level of hardware performance is not required to train the AI model, but hardware performance sufficient to run the AI inference service is required, so it is possible on a machine with a certain amount of hardware performance.

3.2. AI inference service on Cloud server

The following figure shows the case where the client module that requests AI service and the AI server module that directly performs AI service run on different machines.

+--------------------------------------++------------------------+ | +---------------------------+ || +-----------------+ | | | +-----------------+ | || | Client module |<-+--------+-----+---->| Server module | | || | for AI service | | | | | for AI service | | || +-----------------+ | | | +-----------------+ | |+------------------------+ | + --------------------------+ | Client machine | Server machine | +--------------------------------------+ Cloud(Internet)

Figure 3: AI inference service on Cloud server

In this case, the client module requesting the AI inference service runs on the client machine, and the AI server module that directly performs the AI inference service runs on a separate server machine, and this server machine is in the cloud network. In this case, the performance of the client machine does not need to be high because the client machine simply needs to request the AI inference service and, if necessary, deliver only the data required for the AI service request. For the AI server module that directly performs AI inference service, we can set up our own AI server, or we can use commercial clouds such as Amazon, Microsoft, and Google.

3.3. AI inference service on Edge device

The following figure shows the case where the client module that requests AI service and the AI server module that directly performs AI service are separated, and the AI server module is located in the edge device.

+--------------------------------------++------------------------+ | +---------------------------+ || +-----------------+ | | | +-----------------+ | || | Client module |<-+--------+-----+---->| Server module | | || | for AI service | | | | | for AI service | | || +-----------------+ | | | +-----------------+ | |+------------------------+ | + --------------------------+ | Client machine | Edge device | +--------------------------------------+ Edge network

Figure 4: AI inference service on Edge device

Even in this case, the client module that requests the AI inference service runs on the client machine, the AI server module that directly performs the AI inference service runs on the edge device, and the edge device is in the edge network. Even in this case, the client module that requests the AI inference service runs on the client machine, the AI server module that directly performs the AI inference service runs on the edge device, and the edge device is in the edge network. The AI module that directly performs the AI inference service on the edge device can directly configure the edge device or use a commercial edge computing module.

The difference from the above case where the AI server module is in the cloud is that the edge device is usually close to the client, whereas the performance is lower than that of the server in the cloud, so there are advantages in data transfer time and inference time, but in unit time Inference service performance is poor.

3.4. AI inference service on Cloud server and Edge device

The following figure shows the case where AI server modules that directly perform AI services are distributed in the cloud and edge devices.

+--------------------------------------++------------------------+ | +---------------------------+ || +-----------------+ | | | +-----------------+ | || | Client module |<-+---+----+-----+---->| Server module | | || | for AI service |<-+---+ | | | for AI service | | || +-----------------+ | | | | +-----------------+ | |+------------------------+ | | + --------------------------+ | Client machine | | Edge device | | +--------------------------------------+ | Edge network | | +--------------------------------------+ | | +---------------------------+ | | | | +-----------------+ | | +----+-----+---->| Server module | | | | | | for AI service | | | | | +-----------------+ | | | + --------------------------+ | | Server machine | +--------------------------------------+ Cloud(Internet)

Figure 5: AI inference service on Cloud sever and Edge device

There is a difference between the AI server module performed in the cloud and the AI server module performed on the edge device in terms of AI inference service performance. Therefore, the client requesting the AI inference service may request by distributing the AI inference service request to the cloud and edge device appropriately in order to perform the desired AI service. In other words, in the case of an AI service with low inference accuracy but short inference time, we can request an AI inference service to the edge device.

4. Considerations when configuring a system to provide AI services

As described in the previous chapter, the AI server module that directly performs AI inference services by utilizing AI models can be performed on a local machine or a cloud server or an edge device. In theory, if AI inference service is performed on a local machine, AI service can be provided without communication delay time or packet loss, but a certain amount of hardware performance is required to perform AI service inference. So, in the future environment where AI services become popular, such as when various AI services are activated and AI services are disseminated, the cost of a machine that performs AI services is important and this case would not that

many. If so, whether the AI inference service will be performed on the cloud server or the discount price on the edge device can be a determining factor in the system configuration.

When AI inference service request is made to a distant cloud server, it may take a lot of time to transmit, but it has the advantage of being able to perform many AI inference service requests in a short time, and the accuracy of AI service inference increases. Conversely, when an AI service request is made to a nearby edge device, the transmission time is short, but many AI inference service requests cannot be performed at once, and the accuracy of AI service inference is lowered. Therefore, by analyzing the characteristics and requirements of the AI service to be performed, it is necessary to determine where to perform the AI inference service on a local machine, a cloud server, or an edge device.

According to the characteristics of the AI service, the characteristics of the data used for training and the problem to be solved, the hardware characteristics of the machine performing the AI service varies. In general, machines on cloud servers are viewed as machines with higher performance than edge devices. However, the performance of AI inference service varies depending on how the hardware such as CPU, RAM, GPU, and network interface is configured for each cloud server and edge device. If we do not think about cost, it is good to configure a system for performing AI services with a machine with the best hardware performance, but in reality, we should always consider the cost when configuring the system. So, according to the characteristics and requirements of the AI service to be performed, the performance of the local machine, cloud server, and edge device must be determined.

Although not directly related to communication/network, the biggest influence on AI inference services is the AI model to be used for AI inference service. For example, in AI services such as image classification, there are various types of AI models such as ResNet, EfficientNet, VGG, and Inception. These AI models differ in AI inference accuracy, but also in AI model file size and AI inference time. AI models with the highest inference accuracy typically have very large file sizes and take a lot of AI inference time. So, when constructing an AI service system, it is not always good to choose an AI model with the highest AI inference accuracy. Again, it is important to select an AI model according to the characteristics and requirements of the AI service to be performed.

Experimentally, it is recommended to use an AI model with high AI inference accuracy in the cloud server, and use an AI model that can provide fast AI inference service although the AI inference accuracy

is slightly lower for the fast AI inference service in the edge device.

It might be a bit of an implementation issue, but we should also consider how we deliver AI services on cloud servers or edge devices. With the current technology, a traditional web server method or a server method specialized for AI service inference (e.g., Google’s Tensorflow Serving) can be used. Traditional web server methods such as Flask and Django have the advantage of running on various types of machines, but since they are designed to support general web services, the service execution time is not fast. Tensorflow Serving uses the features of Tensorflow to make AI service inference services very fast and efficient. However, older CPUs that do not support AVX cannot use the Tensorflow serving function because Google’s Tensorflow does not run. Therefore, rather than unconditionally using the server method specialized in AI service inference, it is necessary to decide the AI server module method that provides AI services in consideration of the hardware characteristics of the AI system that can be built.

The communication method for transferring data to request AI inference service is also an important decision in constructing an AI system. Using the traditional REST method, it can be used for various machines and services, but its performance is inferior to Google’s gRPC. There are many advantages to using gRPC for AI inference services because Google’s gRPC enables large-capacity data transfer and efficient data transfer compared to REST.

Cloud-edge collaboration-based AI service development is actively underway. In particular, in the case of AI services that are sensitive to network delays, such as object recognition and autonomous vehicle services, (micro)services for inference are placed on edge devices to obtain fast inference results and provide services. As such, in the development of intelligent IoT services, various devices that can provide computing services within the network, such as edge devices, are being added as network elements, and the number of IoT devices using them is rapidly increasing. Therefore, a new function for computing resource management and operation is required in terms of providing computing services within the network.

There are no IANA considerations related to this document.

When AI service is performed on a local machine, there is no security issue, but when AI service is provided through a cloud server or edge device, IP address and port number may be known to the outside can attack. Therefore, when providing AI services by utilizing machines on the network such as cloud servers and edge devices, it is necessary to analyze the characteristics of the modules to be used well, identify vulnerabilities in security, and take countermeasures.

7. Acknowledgements

[RFC6574] Tschofenig, H. and J. Arkko, "Report from the Smart Object Workshop", RFC 6574, DOI 10.17487/RFC6574, April 2012, <https://www.rfc-editor.org/info/rfc6574>.

[RFC7452] Tschofenig, H., Arkko, J., Thaler, D., and D. McPherson, "Architectural Considerations in Smart Object Networking", RFC 7452, DOI 10.17487/RFC7452, March 2015, <https://www.rfc-editor.org/info/rfc7452>.

[I-D.irtf-t2trg-iot-edge] Hong, J., Hong, Y., de Foy, X., Kovatsch, M., Schooler, E., and D. Kutscher, "IoT Edge Challenges and Functions", draft-irtf-t2trg-iot-edge-07 (work in progress), June 2022.

[CG-AIoT] "ITU-T CG-AIoT", <https://www.itu.int/en/ITU-T/ studygroups/2017-2020/20/Pages/ifa-structure.aspx>.

[tinyML] "tinyML Foundation", <https://www.tinyml.org/>.

[AI_inference_archtecture] "IBM Systems, AI Infrastructure Reference Architecture", <https://www.ibm.com/downloads/cas/W1JQBNJV>.

[Google_cloud_iot] "Bringing intelligence to the edge with Cloud IoT", <https://cloud.google.com/blog/products/gcp/bringing- intelligence-edge-cloud-iot>.

Yong-Geun Hong Daejeon University 62 Daehak-ro, Dong-gu Daejeon 34520 Korea

Phone: +82 42 280 4841 Email: yonggeun.hong@gmail.com

SeokBeom Oh KSA Digital Transformation Center, 5 Teheran-ro 69-gil, Gangnamgu Seoul 06160 Korea

Phone: +82 2 1670 6009 Email: isb6655@korea.ac.kr

Joo-Sang Youn DONG-EUI University 176 Eomgwangno Busan_jin_gu Busan 614-714 Korea

Phone: +82 51 890 1993 Email: joosang.youn@gmail.com

SooJeong Lee Korea University/KT 2511 Sejong-ro Sejong City 30019 Korea

Email: ngenius@korea.ac.kr

Hyun-Kook Kahng Korea University 2511 Sejong-ro Sejong City 30019 Korea

Email: kahng@korea.ac.kr

Internet Research Task Force C. ZhouInternet-Draft H. YangIntended status: Informational X. DuanExpires: 12 January 2023 China Mobile D. Lopez A. Pastor Telefonica I+D Q. Wu Huawei M. Boucadair C. Jacquenet Orange 11 July 2022

Digital Twin Network: Concepts and Reference Architecture draft-irtf-nmrg-network-digital-twin-arch-01

Abstract

Digital Twin technology has been seen as a rapid adoption technology in Industry 4.0. The application of Digital Twin technology in the networking field is meant to develop various rich network applications and realize efficient and cost effective data driven network management and accelerate network innovation.

This document presents an overview of the concepts of Digital Twin Network, provides the basic definitions and a reference architecture, lists a set of application scenarios, and discusses the benefits and key challenges of such technology.

Status of This Memo

Zhou, et al. Expires 12 January 2023 [Page 1]

Internet-Draft Digital Twin Network Concept July 2022

Copyright Notice

This document is subject to BCP 78 and the IETF Trust’s Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.

Table of Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1. Acronyms & Abbreviations . . . . . . . . . . . . . . . . 4 2.2. Definitions . . . . . . . . . . . . . . . . . . . . . . . 4 3. Introduction and Concepts of Digital Twin Network . . . . . . 4 3.1. Background of Digital Twin . . . . . . . . . . . . . . . 4 3.2. Digital Twin for Networks . . . . . . . . . . . . . . . . 5 3.3. Definition of Digital Twin Network . . . . . . . . . . . 6 4. Benefits of Digital Twin Network . . . . . . . . . . . . . . 9 4.1. Optimized Network Total Cost of Operation . . . . . . . . 10 4.2. Optimized Decision Making . . . . . . . . . . . . . . . . 10 4.3. Safer Assessment of Innovative Network Capabilities . . . 10 4.4. Privacy and Regulatory Compliance . . . . . . . . . . . . 11 4.5. Customized Network Operation Training . . . . . . . . . . 11 5. Challenges to Build Digital Twin Network . . . . . . . . . . 11 6. A Reference Architecture of Digital Twin Network . . . . . . 13 7. Enabling Technologies to Build Digital Twin Network . . . . . 16 7.1. Data Collection and Data Services . . . . . . . . . . . . 16 7.2. Network Modeling . . . . . . . . . . . . . . . . . . . . 17 7.3. Network Visualization . . . . . . . . . . . . . . . . . . 18 7.4. Interfaces . . . . . . . . . . . . . . . . . . . . . . . 19 8. Interaction with IBN . . . . . . . . . . . . . . . . . . . . 19 9. Sample Application Scenarios . . . . . . . . . . . . . . . . 20 9.1. Human Training . . . . . . . . . . . . . . . . . . . . . 20 9.2. Machine Learning Training . . . . . . . . . . . . . . . . 20 9.3. DevOps-Oriented Certification . . . . . . . . . . . . . . 21 9.4. Network Fuzzing . . . . . . . . . . . . . . . . . . . . . 21 10. Research Perspectives: A Summary . . . . . . . . . . . . . . 21 11. Security Considerations . . . . . . . . . . . . . . . . . . . 21 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 22 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22 14. Open issues . . . . . . . . . . . . . . . . . . . . . . . . . 22

15. References . . . . . . . . . . . . . . . . . . . . . . . . . 22 15.1. Normative References . . . . . . . . . . . . . . . . . . 23 15.2. Informative References . . . . . . . . . . . . . . . . . 23 Appendix A. Change Logs . . . . . . . . . . . . . . . . . . . . 27 Authors’ Addresses . . . . . . . . . . . . . . . . . . . . . . . 28

1. Introduction

The fast growth of network scale and the increased demand placed on these networks require them to accommodate and adapt dynamically to customer needs, implying a significant challenge to network operators. Indeed, network operation and maintenance are becoming more complex due to higher complexity of the managed networks and the sophisticated services they are delivering. As such, providing innovations on network technologies, management and operation will be more and more challenging due to the high risk of interfering with existing services and the higher trial costs if no reliable emulation platforms are available.

A Digital Twin is the real-time representation of a physical entity in the digital world. It has the characteristics of virtual-reality interrelation and real-time interaction, iterative operation and process optimization, full life-cycle and comprehensive data-driven network infrastructure. Currently, digital twin has been widely acknowledged in academic publications. See more in Section 3.

A digital twin for networks platform can be built by applying Digital Twin technologies to networks and creating a virtual image of physical network facilities (called herein, emulation). Basically, the digital twin for networks is an expansion platform of network simulation. The main difference compared to traditional network management systems is the interactive virtual-real mapping and data driven approach to build closed-loop network automation. Therefore, a digital twin network platform is more than an emulation platform or network simulator.

Through the real-time data interaction between the physical network and its twin network(s), the digital twin network platform might help the network designers to achieve more simplification, automatic, resilient, and full life-cycle operation and maintenance. More specifically, the digital twin network can, thus, be used to develop various rich network applications and assess specific behaviors (including network transformation) before actual implementation in the physical network, tweak the network for better optimized behavior, run ’what-if’ scenarios that cannot be tested and evaluated easily in the physical network. In addition, service impact analysis tasks can also be facilitated.

2. Terminology

2.1. Acronyms & Abbreviations

IBN: Intent-Based Networking

AI Artificial Intelligence

CI/CD: Continuous Integration / Continuous Delivery

ML: Machine Learning

OAM: Operations, Administration, and Maintenance

PLM: Product Lifecycle Management

2.2. Definitions

This document makes use of the following terms:

Digital Twin: a virtual instance of a physical system (twin) that is continually updated with the latter’s performance, maintenance, and health status data throughout the physical system’s life cycle.

Digital twin network: a digital twin that is used in the context of networking. This is also called, digital twin for networks. See more in Section 3.3.

3. Introduction and Concepts of Digital Twin Network

3.1. Background of Digital Twin

The concept of the "twin" dates to the National Aeronautics and Space Administration (NASA) Apollo program in the 1970s, where a replica of space vehicles on Earth was built to mirror the condition of the equipment during the mission [Rosen2015].

In 2003, Digital Twin was attributed to John Vickers by Michael Grieves in his product lifecycle management (PLM) course as "virtual digital representation equivalent to physical products" [Grieves2014]. Digital twin can be defined as a virtual instance of a physical system (twin) that is continually updated with the latter’s performance, maintenance, and health status data throughout the physical system’s life cycle [Madni2019]. By providing a living copy of physical system, digital twins bring numerous advantages, such as accelerated business processes, enhanced productivity, and faster innovation with reduced costs. So far, digital twin has been

successfully applied in the fields of intelligent manufacturing, smart city, or complex system operation and maintenance to help with not only object design and testing, but also management aspects [Tao2019].

Compared with ’digital model’ and ’digital shadow’, the key difference of ’digital twin’ is the direction of data between the physical and virtual systems [Fuller2020]. Typically, when using a digital twin, the (twin) system is generated and then synchronized using data flows in both directions between physical and digital components, so that control data can be sent, and changes between the physical and digital objectives of systems are automatically represented. This behavior is unlike a ’digital model’ or ’digital shadow’, which are usually synchronized manually, lacking of control data, and might not have a full cycle of data integrated.

At present (2022), there is no unified definition of digital twin framework. The industry, scientific research institutions, and standards developing organizations are trying to define a general or domain-specific framework of digital twin. [Natis-Gartner2017] proposed that building a digital twin of a physical entity requires four key elements: model, data, monitoring, and uniqueness. [Tao2019] proposed a five-dimensional framework of digital twin {PE, VE, SS, DD, CN}, in which PE represents physical entity, VE represents virtual entity, SS represents service, DD represents twin data, and CN represents the connection between various components. [ISO-2021] issued a draft standard for digital twin manufacturing system, and proposed a reference framework including data collection domain, device control domain, digital twin domain, and user domain.

3.2. Digital Twin for Networks

Communication networks can provide a solid foundation for implementing various ’digital twin’ applications. At the same time, in the face of increasing business types, scale and complexity, a network itself also needs to use digital twin technology to seek better solutions beyond physical network. Since 2017, the application of digital twin technology in the field of communication networks has gradually been researched. Some examples are listed below.

In academy, [Dong2019] established the digital twin of 5G mobile edge computing (MEC) network, used the twin offline to train the resource allocation optimization and normalized energy-saving algorithm based on reinforcement learning, and then updated the scheme to MEC network. [Dai2020] established a digital twin edge network for mobile edge computing system, in which a twin edge server is used to evaluate the state of entity server, and the twin mobile edge

computing system provides data for training offloading strategy. [Nguyen2021] discusses how to deploy a digital twin for complex 5G networks. [Hong2021] presents a digital twin platform towards automatic and intelligent management for data center networks, and then proposes a simplified the workflows of network service management. In addition, international workshops dedicated to digital twin in network field have already appeared, such as IEEE DTPI 2021 - Digital Twin Network Online Session [DTPI2021], and IEEE NOMS 2022 - TNT workshop [TNT2022].

Although the application of digital twin technology in networking has started, the research of digital twin for networks technology is still in its infancy. Current applications focus on specific scenarios (such as network optimization), where network digital twin is just used as a network simulation tool to solve the problem of network operation and maintenance. Combined with the characteristics of digital twin technology and its application in other industries, this document believes that digital twin network can be regarded as an indispensable part of the overall network system and provides a general architecture involving the whole life cycle of physical network in the future, serving the application of network innovative technologies such as network planning, construction, maintenance and optimization, improving the automation and intelligence level of the network.

3.3. Definition of Digital Twin Network

So far, there is no standard definition of "digital twin network" within the networking industry. This document defines "digital twin network" as a virtual representation of the physical network. Such virtual representation of the network is meant to be used to analyze, diagnose, emulate, and then control the physical network based on data, models, and interfaces. To that aim, a real-time and interactive mapping is required between the physical network and its virtual twin network.

Referring the characteristics of digital twin in other industries and the characteristics of the networking itself, the digital twin network should involve four key elements: data, mapping, models and interfaces as shown in Figure 1.

+-------------+ +--------------+ | | | | | Mapping | | Interface | | | | | +-------------+-----------------+--------------+ | | | Analyze, Diagnose | | | | +----------------------+ | | | Digital Twin Network | | | +----------------------+ | +------------+ +------------+ | | Emulate, Control | | | Models | | Data | | |------------------------| | +------------+ +------------+

Figure 1: Key Elements of Digital Twin Network

Data: A digital twin network should maintain historical data and/or real time data (configuration data, operational state data, topology data, trace data, metric data, process data, etc.) about its real-world twin (i.e. physical network) that are required by the models to represent and understand the states and behaviors of the real-world twin.

The data is characterized as the single source of "truth" and populated in the data repository, which provides timely and accurate data service support for building various models.

Models: Techniques that involve collecting data from one or more sources in the real-world twin and developing a comprehensive representation of the data (e.g., system, entity, process) using specific models. These models are used as emulation and diagnosis basis to provide dynamics and elements on how the live physical network operates and generates reasoning data utilized for decision-making.

Various models such as service models, data models, dataset models, or knowledge graph can be used to represent the physical network assets and, then, instantiated to serve various network applications.

Interfaces: Standardized interfaces can ensure the interoperability of digital twin network. There are two major types of interfaces:

* The interface between the digital twin network platform and the physical network infrastructure.

* The interface between digital twin network platform and applications.

The former provides real-time data collection and control on the physical network. The latter helps in delivering application requests to the digital twin network platform and exposing the various platform capabilities to applications.

Mapping: Used to identify the digital twin and the underlying entities and establish a real-time interactive relation between the physical network and the twin network or between two twin networks. The mapping can be:

* One to one (pairing, vertical): Synchronize between a physical network and its virtual twin network with continuous flows.

* One to many (coupling, horizontal): Synchronize among virtual twin networks with occasional data exchange.

Such mappings provide a good visibility of actual status, making the digital twin suitable to analyze and understand what is going on in the physical network. It also allows using the digital twin to optimize the performance and maintenance of the physical network.

The digital twin network constructed based on the four core technology elements can analyze, diagnose, emulate, and control the physical network in its whole life cycle with the help of optimization algorithms, management methods, and expert knowledge. One of the objectives of such control is to master the digital twin network environment and its elements to derive the required system behavior, e.g., provide:

* repeatability: that is the capacity to replicate network conditions on-demand.

* reproducibility: i.e., the ability to replay successions of events, possibly under controlled variations.

Note: Real-time interaction is not always mandatory for all twins. When testing some configuration changes or trying some innovative techniques, the digital twins can behave as a simulation platform without the need of real time telemetry data. And even in this scenario, it is better to have interactive mapping capability so that the validated changes can be tested in real network whenever required by the testers. In most other cases (e.g., network optimization, network fault recovery), real-time interaction between virtual and real network is mandatory. This way, digital twin network can help achieve the goal of autonomous network or self-driven network.

4. Benefits of Digital Twin Network

Digital twin network can help enabling closed-loop network management across the entire lifecycle, from deployment and emulation, to visualized assessment, physical deployment, and continuous verification. By doing so, network operators and end-users to some extent, as allowed by specific application interfaces, can maintain a global, systemic, and consistent view of the network. Also, network operators and/or enterprise user can safely exercise the enforcement of network planning policies, deployment procedures, etc., without jeopardizing the daily operation of the physical network.

The main difference between digital twin network and simulation platform is the use of interactive virtual-real mapping to build closed-loop network automation. Simulation platforms are the predecessor of the digital twin network, one example of such a simulation platform is network simulator [NS-3], which can be seen as a variant of digital twin network but with low fidelity and lacking for interactive interfaces to the real network. Compared with those classical approaches, key benefits of digital twin network can be summarized as follows:

1) Using real-time data to establish high fidelity twins, the effectiveness of network simulation is higher; then the simulation cost will be relatively low.

2) The impact and risk on running networks is low when automatically applying configuration/policy changes after the full analysis and required verifications (e.g., service impact analysis) within the twin network.

3) The faults of the physical network can be automatically captured by analyzing real-time data, then the correction strategy can be distributed to the physical network elements after conducting adequate analysis within the twins to complete the closed-loop automatic fault repair.

The following subsections further elaborate such benefits in details.

4.1. Optimized Network Total Cost of Operation

Large scale networks are complex to operate. Since there is no effective platform for simulation, network optimization designs have to be tested on the physical network at the cost of jeopardizing its daily operation and possibly degrading the quality of the services supported by the network. Such assessment greatly increases network operator’s Operational Expenditure (OPEX) budgets too.

With a digital twin network platform, network operators can safely emulate candidate optimization solutions before deploying them in the physical network. In addition, operator’s OPEX on the real physical network deployment will be greatly decreased accordingly at the cost of the complexity of the assessment and the resources involved.

4.2. Optimized Decision Making

Traditional network operation and management mainly focus on deploying and managing running services, but hardly support predictive maintenance techniques.

Digital twin network can combine data acquisition, big data processing, and AI modeling to assess the status of the network, but also to predict future trends, and better organize predictive maintenance. The ability to reproduce network behaviors under various conditions facilitates the corresponding assessment of the various evolution options as often as required.

4.3. Safer Assessment of Innovative Network Capabilities

Testing a new feature in an operational network is not only complex, but also extremely risky. Service impact analysis is required to be adequately achieved prior to effective activation of a new feature.

Digital twin network can greatly help assessing innovative network capabilities without jeopardizing the daily operation of the physical network. In addition, it helps researchers to explore network innovation (e.g., new network protocols, network AI/ML applications) efficiently, and network operators to deploy new technologies quickly with lower risks. Take AI/ ML application as example, it is a conflict between the continuous high reliability requirement (i.e., 99.999%) and the slow learning speed or phase-in learning steps of AI/ML algorithms. With digital twin network, AI/ML can complete the learning and training with the sufficient data before deploying the model in the real network. This would encourage more network AI innovations in future networks.

4.4. Privacy and Regulatory Compliance

The requirements on data confidentiality and privacy on network providers increase the complexity of network management, as decisions made by computation logics such as an SDN controller may rely upon the packet payloads. As a result, the improvement of data-driven management requires complementary techniques that can provide a strict control based upon security mechanisms to guarantee data privacy protection and regulatory compliance. This may range from flow identification (using the archetypal five-tuple of addresses, ports and protocol) to techniques requiring some degree of payload inspection, all of them considered suitable to be associated to an individual person, and hence requiring strong protection and/or data anonymization mechanisms.

With strong modeling capability provided by the digital twin network, very limited real data (if at all) will be needed to achieve similar or even higher level of data-driven intelligent analysis. This way, a lower demand of sensitive data will permit to satisfy privacy requirements and simplify the use of privacy-preserving techniques for data-driven operation.

4.5. Customized Network Operation Training

Network architectures can be complex, and their operation requires expert personnel. Digital twin network offers an opportunity to train staff for customized networks and specific user needs. Two salient examples are the application of new network architectures and protocols or the use of "cyber-ranges" to train security experts in threat detection and mitigation.

5. Challenges to Build Digital Twin Network

According to [Hu2021], the main challenges in building and mantaining digital twins can be summarized as the following five aspects:

* Data acquisition and processing

* High-fidelity modeling

* Real-time, two-way communication between the virtual and the real twins

* Unified development platform and tools

* Environmental coupling technologies

Compared with other industrial fields, digital twin in networking field has its unique characteristics. On one hand, network elements and system have higher level of digitalization, which implies that data acquisition and virtual-real communication are relatively easy to achieve. On the other hand, there are various different type of network elements and typologies in the network field; and the network size is characterized by the numbers of nodes and links in it but the network size growth pace can not meet the service needs, especially in the deployment of end to end service which spans across multiple administrative domains. So, the construction of a digital twin network system needs to consider the following major challenges:

Large scale challenge: A digital twin of large-scale networks will significantly increase the complexity of data acquisition and storage, the design and implementation of relevant models. The requirements of software and hardware of the digital twin network system will be even more constraining. Therefore, efficient and low cost tools in various fields should be required. Take data as an example, massive network data can help achieve more accurate models. However, the cost of virtual-real communication and data storage becomes extremely expensive, especially in the multi- domain data-driven network management case, therefore efficient tools on data collection and data compression methods must be used.

Interoperability: Due to the inconsistency of technical implementations and the heterogeneity of vendor adopted technologies, it is difficult to establish a unified digital twin network system with a common technology in a network domain. Therefore, it is needed firstly to propose a unified architecture of digital twin network, in which all components and functionalities are clear to all stakeholders; then define standardized and unified interfaces to connect all network twins via ensuring necessary compatibility.

Data modeling difficulties: Based on large-scale network data, data modeling should not only focus on ensuring the accuracy of model functions, but also has to consider the flexibility and scalability to compose and extend as required to support large scale and multi-purpose applications. Balancing these requirements further increases the complexity of building efficient and hierarchical functional data models. As an optional solution, straightforwardly clone the real network using virtualized resources is feasible to build the twin network when the network scale is relatively small. However, it will be of unaffordable resource cost for larger scales network. In this case, network modeling using mathematical abstraction or leveraging the AI algorithms will be more suitable solutions.

Real-time requirements: Network services normally have real-time requirements, the processing of model simulation and verification through a digital twin network will introduce the service latency. Meanwhile, the real-time requirements will further impose performance requirements on the system software and hardware. However, given the nature of distributed systems and propagation delays, it is challenge to keep network digital twins in sync or auto-sync between physical network and digital twin network. Changes to the digital object automatically drive changes in the physical object can be even challenging. To address these requirements, the function and process of the data model need to be based on automated processing mechanism under various network application scenarios. On the one hand, it is needed to design a simplified process to reduce the time cost for tasks in network twin as much as possible; on the other hand, it is recommended to define the real-time requirements of different applications, and then match the corresponding computing resources and suitable solutions as needed to complete the task processing in the twin.

Security risks: A digital twin network has to synchronize all or subset of the data related to involved physical networks in real time, which inevitably augments the attack surface, with a higher risk of information leakage, in particular. On one hand, it is mandatory to design more secure data mechanism leveraging legacy data protection methods, as well as innovative technologies such as block chain. On the other hand, the system design can limit the data (especially raw data) requirement on building digital twin network, leveraging innovative modeling technologies such as federal learning.

In brief, to address the above listed challenges, it is important to firstly propose a unified architecture of digital twin network, which defines the main functional components and interfaces (Section 6). Then, relying upon such an architecture, it is required to continue researching on the key enabling technologies including data acquisition, data storage, data modeling, interface standardization, and security assurance.

6. A Reference Architecture of Digital Twin Network

Based on the definition of the key digital twin network technology elements introduced in Section 3.3, a digital twin network architecture is depicted in Figure 2. This digital twin network architecture is broken down into three layers: Application Layer, Digital Twin Layer, and Physical Network Layer.

+---------------------------------------------------------+ | +-------+ +-------+ +-------+ | | | App 1 | | App 2 | ... | App n | Application| | +-------+ +-------+ +-------+ | +-------------^-------------------+-----------------------+ |Capability Exposure| Intent Input | | +-------------+-------------------v-----------------------+ | Instance of Digital Twin Network | | +--------+ +------------------------+ +--------+ | | | | | Service Mapping Models | | | | | | | | +------------------+ | | | | | | Data +---> |Functional Models | +---> Digital| | | | Repo- | | +-----+-----^------+ | | Twin | | | | sitory | | | | | | Network| | | | | | +-----v-----+------+ | | Mgmt | | | | <---+ | Basic Models | <---+ | | | | | | +------------------+ | | | | | +--------+ +------------------------+ +--------+ | +--------^----------------------------+-------------------+ | | | data collection | control +--------+----------------------------v-------------------+ | Physical Network | | | +---------------------------------------------------------+

Figure 2: Reference Architecture of Digital Twin Network

Physical Network: All or subset of network elements in the physical network exchange network data and control messages with a network digital twin instance, through twin-physical control interfaces. The physical network can be a mobile access network, a transport network, a mobile core, a backbone, etc. The physical network can also be a data center network, a campus enterprise network, an industrial Internet of Things, etc.

The physical network can span across a single network administrative domain or multiple network administrative domains.

This document focuses on the IETF related physical network such as IP bearer network and datacenter network.

Digital Twin Layer: This layer includes three key subsystems: Data Repository subsystem, Service Mapping Models subsystem, and Digital Twin Network Management subsystem. These key subsystems can be placed in one single network administrative domain and provide the service to the application (e.g.,SDN controller) in

other network administrative domain, or lied in every network administrative domain and coordinate between each other to provide services to the application in the upper layer.

One or multiple digital twin network instances can be built and maintained:

* Data Repository subsystem is responsible for collecting and storing various network data for building various models by collecting and updating the real-time operational data of various network elements through the twin southbound interface, and providing data services (e.g., fast retrieval, concurrent conflict handling, batch service) and unified interfaces to Service Mapping Models subsystem.

* Service Mapping Models complete data modeling, provide data model instances for various network applications, and maximizes the agility and programmability of network services. The data models include two major types: basic and functional models.

- Basic models refer to the network element model(s) and network topology model(s) of the network digital twin based on the basic configuration, environment information, operational state, link topology and other information of the network element(s), to complete the real-time accurate characterization of the physical network.

- Functional models refer to various data models used for network analysis, emulation, diagnosis, prediction, assurance, etc. The functional models can be constructed and expanded by multiple dimensions: by network type, there can be models serving for a single or multiple network domains; by function type, it can be divided into state monitoring, traffic analysis, security exercise, fault diagnosis, quality assurance and other models; by network lifecycle management, it can be divided into planning, construction, maintenance, optimization and operation. Functional models can also be divided into general models and special-purpose models. Specifically, multiple dimensions can be combined to create a data model for more specific application scenarios.

New applications might need new functional models that do not exist yet. If a new model is needed, ’Service Mapping Models’ subsystem will be triggered to help creating new models based on data retrieved from ’Data Repository’.

* Digital Twin Network Management fulfils the management function of digital twin network, records the life-cycle transactions of the twin entity, monitors the performance and resource consumption of the twin entity or even of individual models, visualizes and controls various elements of the network digital twin, including topology management, model management and security management.

Notes: ’Data collection’ and ’change control’ are regarded as southbound interfaces between virtual and physical network. From implementation perspective, they can optionally form a sub-layer or sub-system to provide common functionalities of data collection and change control, enabled by a specific infrastructure supporting bi-directional flows and facilitating data aggregation, action translation, pre-processing and ontologies.

Application Layer: Various applications (e.g., Operations, Administration, and Maintenance (OAM)) can effectively run over a digital twin network platform to implement either conventional or innovative network operations, with low cost and less service impact on real networks. Network applications make requests that need to be addressed by the digital twin network. Such requests are exchanged through a northbound interface, so they are applied by service emulation at the appropriate twin instance(s).

7. Enabling Technologies to Build Digital Twin Network

This section briefly describes several key enabling technologies to build digital twin work system, based on the challenges and the reference architecture described in above sections. Actually, each enabling technology is worth of deep researching respectively and separately.

7.1. Data Collection and Data Services

Data collection technology is the foundation of building data repository for digital twin network. Target driven mode should be adopted for data collection from heterogeneous data sources. The type, frequency and method of data collection shall meet the application of digital twin network. Whenever building network models for a specific network application, the required data can be efficiently obtained from the data repository.

Diverse existing tools and methods (e.g., SNMP, NETCONF, IPFIX, Telemetry, INT, etc.) can be used to collect different type of network data. And, some innovative new methods (e.g., sketch-based measurement) can be used to acquire complex network data such as network performance. Also, data transformation and aggregation

capacity can be used to improve the applicability on network modelling. Toward building data repository for a digital twin system, data collection tools and methods should be as lightweight as possible, so as to reduce the occupation of network equipment resources, and meaningful so it can be useful. Several solutions on data collection in IETF/IRTF are in working progress, e.g., adaptive subscription defined in [I-D.ietf-netconf-adaptive-subscription], efficient data collection define in [I-D.zcz-nmrg-digitaltwin-data-collection], and contextual information defined in [I-D.claise-opsawg-collected-data-manifest].

Data repository works to effectively store large-scale and heterogeneous network data, as well provide data and services to build various network models. So, it is also necessary to study technologies regarding data services including fast search, batch- data handling, conflict avoidance, data access interfaces, etc.

7.2. Network Modeling

The basic network element models and topology models help generate virtual twin of the network according to the network element configuration, operation data, network topology relationship, link state and other network information. Then the operation status can be monitored and displayed, and the network configuration change and optimization strategy can be pre-verified.

For small scale network, network simulating tools (e.g., [NS-3], [Mininet], etc.) and emulating tools (e.g., [EVE-NG], [GNS-3]) can be used to build basic network models. By using the packet processing capability of virtual network element, such tools can quickly verify the functions of the control plane and data plane. However, this modeling method also has many limitations, including high resource consumption, poor performance analysis ability, and poor scalability. For large scale network, mathematical abstraction methods can be used to build basic network models efficiently. Knowledge graph, network calculus, and formal verification can be candidate methods. Some relevant researches have emerged in recent years, such as [Hong2021], [G2-SIGCOMM], and [DNA-2022]. Going forward, how to improve the extensibility and accuracy of the models is still a big challenge.

As an example, the theory of bottleneck structures introduced in [G2-SIGCOMM, G2-SIGMETRICS] can be used to construct a mathematical model of the network (see also [I-D.giraltyellamraju-alto-bsg-requirements] for more info). A bottleneck structure is a computational graph that efficiently captures the topology, the routing and flow properties of the network. The graph embeds the latent relationships that exist between bottlenecks and the application flows in a distributed

system, providing an efficient mathematical framework to compute the ripple effects of perturbations (e.g., a flow arriving or departing from the system, or the dynamic change in capacity of a wireless link, among others). Because these perturbations can be seen as mathematical derivatives of the communication system, bottleneck structures can be used to compute optimized network configurations, providing a natural engineering sandbox for building network models. One of the key advantages of bottleneck structures is that they can be used to compute (symbolically or numerically) key performance indicators of the network (e.g., expected flow throughput, projected flow completion time, etc.) without the need to use computationally intensive simulators. This capability can be especially useful when building a digital twin or a large-scale network, potentially saving orders or magnitude in computational resources in comparison to simulation or emulation-based approaches.

The functional model aims to realize the dynamic evolution of network performance evaluation and intelligent decision-making. Data driven AI/ML algorithm will play a great role in building complex network functional models. As a research hotspot in recent years, many successfully cases have been demonstrated, such as [RouteNet], [MimicNet], etc. In the future, in addition to improving the generalization ability and interpretability of AI models, we also need to focus on how to improve the real-time and interactivity of model reasoning based on data and control in network digital twin layer.

7.3. Network Visualization

It is the internal requirement of the digital twin network system to use network visibility technology to visually present the data and model in the network twin with high fidelity and intuitively reflect the interactive mapping between the physical network entity and the network twin. Network Visibility technology can help users understand the internal structure of the network, and also help mine valuable information hidden in the network.

Network Visibility can use algorithms such as hierarchical layout, heuristic layout or force oriented layout (or a combination of several algorithms) for topology layout. And the related topology data can be acquired using solutions provided in [RFC8345], [RFC8346], [RFC8944], etc. Meanwhile, digital twin network system can select different interaction methods or combinations of interaction methods to realize the visual dynamic interaction mapping of virtual and real networks. The data query technology such as SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware.

7.4. Interfaces

Based on the reference architecture, there are three types of interfaces on building a digital twin network system.

1) Southbound interfaces are twin interfaces between the physical network and its twin entity. They are responsible for information exchange between physical network and network digital twin. The candidate interfaces can be SNMP, NetConf, etc.

2) Northbound interfaces are Application-facing interfaces between the network digital twin and applications. They are responsible for information exchange between network digital twin and network applications. The lightweight and extensible [RESTFul] interface can be the candidate northbound interface.

3) Internal interfaces are within network digital twin layer. They are responsible for information exchange between the three subsystems: Data Repository, Service Mapping Models, and Digital Twin Network Management. These interfaces should be of high- speed, high-efficiency and high-concurrency. The candidate interfaces or protocols can be XMPP (defined in [RFC7622]), and HTTP/3.0 (defined in [RFC9114]).

All interfaces are recommended to be open and standardized so as to help avoid either hardware or software vendor lock, and achieve inter-operability. Besides the interfaces list above, some new interfaces or protocols can be created to better serve digital twin network system.

8. Interaction with IBN

Implementing Intent-Based Networking (IBN) is an innovative technology for life-cycle network management. Future networks will be possibly Intent-based, which means that users can input their abstract ’intent’ to the network, instead of detailed policies or configurations on the network devices. [I-D.irtf-nmrg-ibn-concepts-definitions] clarifies the concept of "Intent" and provides an overview of IBN functionalities. The key characteristic of an IBN system is that user intent can be assured automatically via continuously adjusting the policies and validating the real-time situation.

IBN can be envisaged in a digital twin network context to show how digital twin network improves the efficiency of deploying network innovation. To lower the impact on real networks, several rounds of adjustment and validation can be emulated on the digital twin network platform instead of directly on physical network. Therefore, digital twin network can be an important enabler platform to implement IBN systems and speed up their deployment.

9. Sample Application Scenarios

Digital twin network can be applied to solve different problems in network management and operation.

9.1. Human Training

The usual approach to network OAM with procedures applied by humans is open to errors in all these procedures, with impact in network availability and resilience. Response procedures and actions for most relevant operational requests and incidents are commonly defined to reduce errors to a minimum. The progressive automation of these procedures, such as predictive control or closed-loop management, reduce the faults and response time, but still there is the need of a human-in-the-loop for multiples actions. These processes are not intuitive and require training to learn how to respond.

The use of digital twin network for this purpose in different network management activities will improve the operators performance. One common example is cybersecurity incident handling, where "cyber- range" exercises are executed periodically to train security practitioners. Digital twin network will offer realistic environments, fitted to the real production networks.

9.2. Machine Learning Training

Machine Learning requires data and their context to be available in order to apply it. A common approach in the network management environment has been to simulate or import data in a specific environment (the ML developer lab), where they are used to train the selected model, while later, when the model is deployed in production, re-train or adjust to the production environment context. This demands a specific adaption period.

Digital twin network simplifies the complete ML lifecycle development by providing a realistic environment, including network topologies, to generate the data required in a well-aligned context. Dataset generated belongs to the digital twin network and not to the production network, allowing information access by third parties, without impacting data privacy.

9.3. DevOps-Oriented Certification

The potential application of CI/CD models network management operations increases the risk associated to deployment of non- validated updates, what conflicts with the goal of the certification requirements applied by network service providers. A solution for addressing these certification requirements is to verify the specific impacts of updates on service assurance and SLAs using a digital twin network environment replicating the network particularities, as a previous step to production release.

Digital twin network control functional block supports such dynamic mechanisms required by DevOps procedures.

9.4. Network Fuzzing

Network management dependency on programmability increases systems complexity. The behavior of new protocol stacks, API parameters, and interactions among complex software components are examples that imply higher risk to errors or vulnerabilities in software and configuration.

Digital twin network allows to apply fuzzing testing techniques on a twin network environment, with interactions and conditions similar to the production network, permitting to identify and solve vulnerabilities, bugs and zero-days attacks before production delivery.

10. Research Perspectives: A Summary

Research on digital twin network has just started. This document presents an overview of the digital twin network concepts and reference architecture. Looking forward, further elaboration on digital twin network scenarios, requirements, architecture, and key enabling technologies should be investigated by the industry, so as to accelerate the implementation and deployment of digital twin network.

This document describes concepts and definitions of digital twin network. As such, the following security considerations remain high level, i.e., in the form of principles, guidelines or requirements.

Security considerations of the digital twin network include:

* Secure the digital twin system itself.

* Data privacy protection.

Securing the digital twin network system aims at making the digital twin system operationally secure by implementing security mechanisms and applying security best practices. In the context of digital twin network, such mechanisms and practices may consist in data verification and model validation, mapping operations between physical network and digital counterpart network by authenticated and authorized users only.

Synchronizing the data between the physical and the digital twin networks may increase the risk of sensitive data and information leakage. Strict control and security mechanisms must be provided and enabled to prevent data leaks.

12. Acknowledgements

Many thanks to the NMRG participants for their comments and reviews. Thanks to Daniel King, Quifang Ma, Laurent Ciavaglia, Jerome Francois, Jordi Paillisse, Luis Miguel Contreras Murillo, Alexander Clemm, Qiao Xiang, Ramin Sadre, Pedro Martinez-Julia, Wei Wang, Zongpeng Du, and Peng Liu.

Diego Lopez and Antonio Pastor were partly supported by the European Commission under Horizon 2020 grant agreement no. 833685 (SPIDER), and grant agreement no. 871808 (INSPIRE-5Gplus).

This document has no requests to IANA.

14. Open issues

* Some technologies (e.g. Network connectivity, Real-time data communication, Collaboration management, conflict detection and resolution, etc.) recently discussed in the IRTF/IETF should be described.

* In section of ’Sample Application Scenarios’, to dig deeper into one or two use cases.

* On the research side, the idea behind digital twin networks is reminiscent of earlier work from the 1990s that should be referenced/acknowledged. Examples include the Shadow MIB concept, Inductive Modeling Technique, etc.

15. References

[RFC7622] Saint-Andre, P., "Extensible Messaging and Presence Protocol (XMPP): Address Format", RFC 7622, DOI 10.17487/RFC7622, September 2015, <https://www.rfc-editor.org/info/rfc7622>.

[RFC8345] Clemm, A., Medved, J., Varga, R., Bahadur, N., Ananthakrishnan, H., and X. Liu, "A YANG Data Model for Network Topologies", RFC 8345, DOI 10.17487/RFC8345, March 2018, <https://www.rfc-editor.org/info/rfc8345>.

[RFC8346] Clemm, A., Medved, J., Varga, R., Liu, X., Ananthakrishnan, H., and N. Bahadur, "A YANG Data Model for Layer 3 Topologies", RFC 8346, DOI 10.17487/RFC8346, March 2018, <https://www.rfc-editor.org/info/rfc8346>.

[RFC8944] Dong, J., Wei, X., Wu, Q., Boucadair, M., and A. Liu, "A YANG Data Model for Layer 2 Network Topologies", RFC 8944, DOI 10.17487/RFC8944, November 2020, <https://www.rfc-editor.org/info/rfc8944>.

[RFC9114] Bishop, M., Ed., "HTTP/3", RFC 9114, DOI 10.17487/RFC9114, June 2022, <https://www.rfc-editor.org/info/rfc9114>.

[Dai2020] Dai, Y. Dai., Zhang, K. Zhang., Maharjan, S. Maharjan., and Yan Zhang. Zhang, "Deep Reinforcement Learning for Stochastic Computation Offloading in Digital Twin Networks. IEEE Transactions on Industrial Informatics, vol. 17, no. 17", August 2020.

[DNA-2022] Zhang, P. Zhang., Gember-Jacobson, A. Gember-Jacobson., Zuo, Y. Zuo., Huang, Y Huang., Liu, X. Liu., and H. Li. Li, "Differential Network Analysis, USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)", April 2022.

[Dong2019] Dong, R. Dong., She, C. She., HardjawanaLiu, W. Hardjawana., Li, Y. Li., and B. Vucetic. Vucetic, "Deep Learning for Hybrid 5G Services in Mobile Edge Computing Systems: Learn from a Digital Twin. IEEE Transactions on Wireless Communications,vol. 18, no. 10", July 2019.

[DTPI2021] "IEEE International Conference on Digital Twins and Parallel Intelligence - Digital Twin Network Session, https://www.dtpi.org/video/10", July 2021.

[EVE-NG] "Emulated Virtual Environment Next Generation, EVE-NG. https://www.eve-ng.net/".

[Fuller2020] Fuller, A. Fuller., Fan, Z., Day, C., and C. Barlow, "Digital Twin: Enabling Technologies, Challenges and Open Research," in IEEE Access, vol. 8, pp. 108952-108971", 2020.

[G2-SIGCOMM] Ros-Giralt, J. Ros-Giralt., Amsel, N. Amsel., Yellamraju, S. Yellamraju., Ezick, J. Ezick., Lethin, R. Lethin., Jiang, Y. Jiang., Feng, A. Feng., Tassiulas, L. Tassiulas., Wu, Z. Wu., and K, Bergman. Bergman, "Designing data center networks using bottleneck structures", ACM SIGCOMM", August 2021.

[G2-SIGMETRICS] Ros-Giralt, J. Ros-Giralt., Bohara, A. Bohara., Yellamraju, S. Yellamraju., Langston, H. Langston., Lethin, R. Lethin., Jiang, Y. Jiang., Tassiulas, L. Tassiulas., Li, J. Li., Tan, Y. Tan., and M. Veeraraghavan. Veeraraghavan, "On the Bottleneck Structure of Congestion-Controlled Networks, ACM SIGMETRICS", December 2019.

[GNS3] "Graphical Network Simulator-3, GNS3. https://www.gns3.com/".

[Grieves2014] Grieves, M. Grieves., "Digital twin: Manufacturing excellence through virtual factory replication", 2003, <https://www.3ds.com/fileadmin/PRODUCTS- SERVICES/DELMIA/PDF/Whitepaper/DELMIA-APRISO-Digital-Twin- Whitepaper.pdf>.

[Hong2021] Hong, H., Wu, Q., Dong, F., Song, W., Sun, R., Han, T., Zhou, C., and H. Yang, "NetGraph: An Intelligent Operated Digital Twin Platform for Data Center Networks. In ACM SIGCOMM 2021 Workshop on Network-Application Integration (NAI’ 21), Virtual Event, USA. ACM, New York, NY, USA", 2021.

[Hu2021] Hu, W., Zhang, T., Deng, X., Liu, Z., and J. Tan, "Digital twin: a state-of-the-art review of its enabling technologies, applications and challenges. Journal of Intelligent Manufacturing and Special Equipment, Vol. 2 No. 1, pp. 1-34", 2021.

[I-D.claise-opsawg-collected-data-manifest] Claise, B., Quilbeuf, J., Lopez, D. R., Dominguez, I., and T. Graf, "A Data Manifest for Contextualized Telemetry Data", Work in Progress, Internet-Draft, draft-claise- opsawg-collected-data-manifest-02, 20 March 2022, <https://www.ietf.org/archive/id/draft-claise-opsawg- collected-data-manifest-02.txt>.

[I-D.giraltyellamraju-alto-bsg-requirements] Ros-Giralt, J., Yellamraju, S., Wu, Q., Contreras, L. M., Yang, R., and K. Gao, "Supporting Bottleneck Structure Graphs in ALTO: Use Cases and Requirements", Work in Progress, Internet-Draft, draft-giraltyellamraju-alto-bsg- requirements-01, 23 March 2022, <https://www.ietf.org/archive/id/draft-giraltyellamraju- alto-bsg-requirements-01.txt>.

[I-D.ietf-netconf-adaptive-subscription] Wu, Q., Song, W., Liu, P., Ma, Q., Wang, W., and Z. Niu, "Adaptive Subscription to YANG Notification", Work in Progress, Internet-Draft, draft-ietf-netconf-adaptive- subscription-00, 23 June 2022, <https://www.ietf.org/archive/id/draft-ietf-netconf- adaptive-subscription-00.txt>.

[I-D.irtf-nmrg-ibn-concepts-definitions] Clemm, A., Ciavaglia, L., Granville, L. Z., and J. Tantsura, "Intent-Based Networking - Concepts and Definitions", Work in Progress, Internet-Draft, draft- irtf-nmrg-ibn-concepts-definitions-09, 24 March 2022, <https://www.ietf.org/archive/id/draft-irtf-nmrg-ibn- concepts-definitions-09.txt>.

[I-D.zcz-nmrg-digitaltwin-data-collection] Zhou, C., Chen, D., and P. Martinez-Julia, "Data Collection Requirements and Technologies for Digital Twin Network", Work in Progress, Internet-Draft, draft-zcz- nmrg-digitaltwin-data-collection-00, 10 July 2022, <https://www.ietf.org/archive/id/draft-zcz-nmrg- digitaltwin-data-collection-00.txt>.

[ISO-2021] ISO, "Digital Twin manufacturing framework - Part 2: Reference architecture: ISO/CD 23247-2. https://www.iso.org/standard/78743.html", 2021.

[Madni2019] Madni, A. Madni., Madni, C. Madni., and S. Lucero. Lucero, "Leveraging digital twin technology in model-based systems engineering. Systems, vol. 7, no. 1, p. 7", January 2019.

[MimicNet] Zhang, Q. Zhang., NG, K. K.W. NG., Kazer, C. W. Kazer., Yan, S. Yan., Sedoc, J. Sedoc., and V. Liu. Liu, "MimicNet: Fast Performance Estimates for Data Center Networks with Machine Learning. In ACM SIGCOMM 2021 Conference (SIGCOMM 21).", August 2021.

[Mininet] "Mninet: An Instant Virtual Network on your Laptop, http://mininet.org/".

[Natis-Gartner2017] Natis, Y. Natis., Velosa, A. Velosa., and W. R. Schulte. Schulte, "Innovation insight for digital twins - driving better IoT-fueled decisions. https://www.gartner.com/en/documents/3645341", 2017.

[Nguyen2021] Nguyen, H. X. Nguyen., Trestian, R. Trestian., To, D. To., and M. Tatipamula. Tatipamula, "Digital Twin for 5G and Beyond. IEEE Communications Magazine, vol. 59, no. 2", February 2021.

[NS-3] "Network Simulator, NS-3. https://www.nsnam.org/".

[RESTFul] Richardson, L. Richardson. and M. Amundsen. Amundsen, "RESTful Web APIs. O’Reilly Media, Inc.", 2013.

[Roson2015] Rosen, R. Rosen., Wichert, G. Von Wichert., Lo, G. Lo., and K.D. Bettenhausen. Bettenhausen, "About the importance of autonomy and DTs for the future of manufacturing. IFAC- Papersonline, Vol. 48, pp. 567-572.", 2015.

[RouteNet] Rusek, K. Rusek., Suárez-Varela, J. Suárez-Varela., Almasan, P. Almasan., Barlet-Ros, P. Barlet-Ros., and A. Cabellos-Aparicio. Cabellos-Aparicio, "RouteNet: Leveraging Graph Neural Networks for network modeling and optimization in SDN. IEEE Journal on Selected Areas in Communication (JSAC), vol. 38, no. 10", October 2020.

[Tao2019] Tao, F. Tao., Zhang, H. Zhang., Liu, A. Liu., and A. Y. C. Nee. Nee, "Digital Twin in Industry: State-of-the-Art. IEEE Transactions on Industrial Informatics, vol. 15, no. 4.", April 2019.

[TNT2022] "IEEE International workshop on Technologies for Network Twins, https://sites.google.com/view/tnt-2022/", 2022.

Appendix A. Change Logs

v06 - v07: Addressed reviewer’s comments from adoption call, including below major changes.

* Resequenced the sections via adding more subsections on concepts of digital twin network, removing the ’Requirements Language’ section, and moving ahead the ’Challenges’ section.

* Cited more papers, or industrial information on digital twin concepts and digital twin for networks.

* Added more information on describing the challenges and key characteristics digital twin network.

* Removed previous open issue on investigating related digital twin network work and identify the differences and commonalities, and added several new open issues for future study.

* Other editorial changes.

v05 - v06: Addressed comments form meeting and maillist, to request adoptoin call.

* Remove acronym DTN to avoid conflict with ’Delay Tolerant Network’;

* Elaborate the descriptoin of Digital Twin Network architecture that supports multiple instances;

* Other Editorial changes.

04 - v05

* Clarify the difference between digital twin network platform and traditional network management system;

* Add more references of researches on applying digital twin to network field;

* Clarify the benefit of ’Privacy and Regulatory Compliance’;

* Refine the description of reference architecture;

* Other Editorial changes.

v03 - v04

* Update data definition and models definitions to clarify their difference.

* Remove the orchestration element and consolidated into control functionality building block in the digital twin network.

* Clarify the mapping relation (one to one, and one to many) in the mapping definition.

* Add explanation text for continuous verification.

v02 - v03

* Split interaction with IBN part as a separate section.

* Fill security section;

* Clarify the motivation in the introduction section;

* Use new boilerplate for requirements language section;

* Key elements definition update.

* Other editorial changes.

* Add open issues section.

* Add section on application scenarios.

Cheng Zhou China Mobile Beijing 100053 China Email: zhouchengyjy@chinamobile.com

Hongwei Yang China Mobile Beijing 100053 China Email: yanghongwei@chinamobile.com

Xiaodong Duan China Mobile Beijing 100053 China Email: duanxiaodong@chinamobile.com

Diego Lopez Telefonica I+D Seville Spain Email: diego.r.lopez@telefonica.com

Antonio Pastor Telefonica I+D Madrid Spain Email: antonio.pastorperales@telefonica.com

Qin Wu Huawei 101 Software Avenue, Yuhua District Nanjing Jiangsu, 210012 China Email: bill.wu@huawei.com

Mohamed Boucadair Orange Rennes 35000 France Email: mohamed.boucadair@orange.com

Christian Jacquenet Orange Rennes 35000 France Email: christian.jacquenet@orange.com

Network Management Research Group J. PaillisseInternet-Draft P. AlmasanIntended status: Informational M. FerriolExpires: 12 January 2023 P. Barlet A. Cabellos UPC-BarcelonaTech S. Xiao X. Shi X. Cheng Huawei D. Perino D. Lopez A. Pastor Telefonica I+D 11 July 2022

A Performance-Oriented Digital Twin for Carrier Networks draft-paillisse-nmrg-performance-digital-twin-00

Abstract

This draft introduces the concept of a Network Digital Twin (NDT) for performance evaluation. A Performance NDT is able to produce performance estimates (delay, jitter, loss) of a given input network with a specific topology, traffic demand, and routing and scheduling configuration. Also, this draft discusses the interface of the digital twin, how it relates to existing control plane elements, use cases, and possible implementation options.

Status of This Memo

Paillisse, et al. Expires 12 January 2023 [Page 1]

Internet-Draft Network Performance Digital Twin July 2022

Copyright Notice

Table of Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Architecture of the Network Performance Digital Twin . . . . 5 4. Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.1. Administrator . . . . . . . . . . . . . . . . . . . . . . 7 4.2. Configuration Interface . . . . . . . . . . . . . . . . . 7 4.3. Digital Twin Interface (DTI) . . . . . . . . . . . . . . 7 5. Mapping to the Network Digital Twin Architecture . . . . . . 8 6. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 9 6.1. Network Operations and Management . . . . . . . . . . . . 9 6.1.1. Network planning . . . . . . . . . . . . . . . . . . 9 6.1.2. What-if scenarios . . . . . . . . . . . . . . . . . . 10 6.1.3. Troubleshooting . . . . . . . . . . . . . . . . . . . 11 6.1.4. Anomaly detection . . . . . . . . . . . . . . . . . . 11 6.1.5. Training . . . . . . . . . . . . . . . . . . . . . . 11 6.2. Network Optimization . . . . . . . . . . . . . . . . . . 12 7. Implementation Challenges . . . . . . . . . . . . . . . . . . 13 7.1. Simulation . . . . . . . . . . . . . . . . . . . . . . . 13 7.2. Emulation . . . . . . . . . . . . . . . . . . . . . . . . 14 7.3. Analytical Modelling . . . . . . . . . . . . . . . . . . 14 7.4. Neural Networks . . . . . . . . . . . . . . . . . . . . . 14 7.4.1. MultiLayer Perceptron . . . . . . . . . . . . . . . . 15 7.4.2. Recurrent Neural Networks . . . . . . . . . . . . . . 15 7.4.3. Convolutional Neural Networks . . . . . . . . . . . . 15 7.4.4. Graph Neural Networks . . . . . . . . . . . . . . . . 15 7.4.5. NN Comparison . . . . . . . . . . . . . . . . . . . . 16 8. Training . . . . . . . . . . . . . . . . . . . . . . . . . . 17 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 10. Security Considerations . . . . . . . . . . . . . . . . . . . 18 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 18 11.1. Normative References . . . . . . . . . . . . . . . . . . 18 11.2. Informative References . . . . . . . . . . . . . . . . . 18

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 22 Authors’ Addresses . . . . . . . . . . . . . . . . . . . . . . . 22

1. Introduction

A Digital Twin for computer networks is a virtual replica of an existing network with a behavior equivalent to that of the real one. The key advantage of a Network Digital Twin (NDT) is the ability to recreate the complexities and particularities of the network infrastructure without the deployment cost of a real network. Hence, network administrators can test, deploy and modify network configurations safely, without worrying about the impact on the real network. Once the administrator has found a configuration that fulfills the expected objectives, it is deployed to the real network. In addition, a NDT is faster, safer and more cost-effective than interacting with the physical network. All these characteristics make NDT useful for different network management tasks ranging from network planning or troubleshooting to optimization.

The concept of a NDT has been proposed for different approaches: network management [I-D.draft-zhou-nmrg-digitaltwin-network-concepts], 5G networks [digital-twin-5G], Vehicular networks [digital-twin-vanets], artificial intelligence [digital-twin-AI], or Industry 4.0 [digital-twin-industry], among others.

This draft proposes a Digital Twin for network management with a focus on performance evaluation. That is, given several input parameters (topology, traffic matrix, etc), a Network Performance Digital Twin (NPDT) predicts network performance metrics such as delay (per path or per link), jitter, or loss. This draft defines the inputs and outputs of such Digital Twin, the associated interfaces with other modules in the network control plane, and details use cases.

In addition, this draft discusses possible implementation options for the NPDT, with a special emphasis on those based on Machine Learning. The aim of Section 7 (Implementation Challenges) is describing the advantages and limitations of these techniques. For example, most Machine Learning technologies rely heavily on large amounts of data to achieve acceptable accuracy. Other considerations include adjusting the architecture of the Neural Network to successfully understand the structure of the input data.

In order to use a Network Performance Digital Twin (NPDT) in practical scenarios (c.f. Section 6), such as network optimization, it should meet certain requirements:

Fast: low delay when making predictions (in the order of milliseconds) to use it in optimization scenarios that need to test a large number of configuration variables (c.f. Section 6.2).

Accurate: the error of the prediction (vs the ground truth) has to be below a certain threshold to be deployable in real-world networks.

Scalable: support networks of arbitrarily large topologies

Variety of Inputs: accept a wide range of combinations of:

* Routing configurations

* Scheduling configurations (FIFO, Weighted Fair Queueing, Deficit Round Robin, etc)

* Topologies

* Traffic Matrices

* Traffic Models (constant bitrate, Poisson, ON/OFF, etc)

Accessible: despite the internal architecture of the NPDT, it needs to be easy to use for network engineers and administrators. This includes, but is not limited to: interfaces to communicate with NPDT that are well-known in the networking community, metrics that are readily understood by network engineers, or confidence values of the estimations.

Note that the inputs and outputs described here are an example, but other inputs and outputs are possible depending on the specificities of each scenario.

2. Terminology

Digital Twin (DT): A virtual replica of a physical system.

Network Digital Twin (NDT): A virtual replica a physical network.

Network Performance Digital Twin (NPDT): A virtual replica a physical network, that can predict with accuracy several performance metrics of the physical network.

Network Optimizer: An algorithm capable of finding the optimal configuration parameters of a network, e.g. OSPF weights, given an optimization objective, e.g. latency below a certain threshold.

Control Plane: Any system, hardware or software, centralized or decentralized, in charge of controlling and managing a physical network. Examples are routing protocols, SDN controllers, etc.

3. Architecture of the Network Performance Digital Twin

Figure 1 presents an overview of the architecture of a Network Performance Digital Twin (NPDT).

Figure 1: Global architecture of the Performance DT

Each element is defined as:

Network Performance Digital Twin (NPDT): a system capable of generating performance estimates of a specific instance of a network.

Physical Network: a real-world network that can be configured via standard interfaces.

Management Plane: The set of hardware and software elements in charge of controlling the Physical Network. This ranges from routing processes, optimization algorithms, network controllers, visibility platforms, etc. The definition, organization and implementation of the elements within the management plane is outside of the scope of this document. In what follows, some elements of the management plane that are relevant to this document are described.

* Optimizer: a network optimizer that can tune the configuration parameters of a network given one or more optimization objectives, e.g. do not exceed a latency threshold in all paths, minimize the load of the most used link, and avoid more than 10 Gbps of traffic at router R4 [DEFO].

* Intent-Based Renderer: a system capable of understanding network intent, according to the definitions in [irtf-nmrg-ibn-concepts-definitions-09].

* Measure: any system to measure the status and performance of a network, e.g. Netflow [RFC3954], streaming telemetry [streaming-telemetry], etc.

* Configure: any system to apply configuration settings to the network devices, e.g. a NETCONF Manager or an end-to-end system to manage device configuration files [facebook-config].

And the functions of each interface are:

DT Interface (DTI): an interface to communicate with the Network Performance Digital Twin (NPDT). Inputs to the DT are a description of the network (topology, routing configuration, etc), and the outputs are performance metrics (delay, jitter, loss, c.f. Section 4).

Configuration Interface (CI): a standard interface to configure the physical network, such as NETCONF [RFC6241], YANG, OpenFlow [OFspec], LISP [RFC6830], etc.

Measurement Interface (MI): a standard interface to collect network

status information, such as Netflow [RFC3954], SNMP, streaming telemetry [openconfig-rtgwg-gnmi-spec-01], etc.

Intent-Based Interface (IBI): an interface for the network administrator to define optimization objectives or run the DT to obtain performance estimates, among others.

4. Interfaces

4.1. Administrator

This interface can be a simple CLI or a state-of-the-art GUI, depending on the final product. In summary, it has to offer the network administrator the following options/features:

* Predict the performance of one or more network scenarios, defined by the administrator. Several use-cases related to this option are detailed in Section 6.1.

* Define network optimization objectives and run the network optimizer.

* Apply the optimized configuration to the physical network.

4.2. Configuration Interface

This interface is used to configure the Physical Network with the configuration parameters obtained from the optimizer. It can be composed of one or more IETF protocols for network configuration, a non-exhaustive list is: NETCONF [RFC6241], RESTCONF/YANG [RFC8040], PCE [RFC4655], OVSDB [RFC7047], or LISP [RFC6830]. It is also possible to use other standards defined outside the IETF that allow the configuration of elements in the forwarding plane, e.g. OpenFlow [OFspec] or P4 Runtime [P4Rspec].

4.3. Digital Twin Interface (DTI)

This interface can be defined with any widespread data format, such as CSV files or JSON objects. There are two groups of data. We are assuming a network with N nodes.

Inputs: data sent to the NPDT to calculate the performance estimates:

* Topology: description of the network topology in graph format, eg. NetworkX [NetworkXlib].

* Routing configuration: a matrix of size N*N. Each cell contains the path from source N(i) to destination N(j) as a series of nodes of the topology. Note that not all source-destination pairs may have a path. Since the NPDT only needs a sequence of nodes to define a route, it supports different routing protocols, from OSPF, IS-IS or BGP, to SRv6, LISP, etc.

* Traffic Demands: a definition of the traffic that is injected into the network. It can be specified with different granularities, ranging from a list of 5-tuple flows and their associated traffic intensity, to a N*N matrix defining the traffic intensity for each source-destination pair. Some source-destination pairs may have zero traffic intensity. The traffic intensity defines parameters of the traffic: bits per second, number of packets, average packet size, etc.

* Traffic Model: the statistical properties of the input traffic, e.g. Video on Demand, backup, VoIP traffic, etc. It can be defined globally for the whole network or individually for each flow in the Traffic Demands.

* Scheduling configuration: attributes associated to the nodes of the topology graph describing the scheduling configuration of the network, that is (1) scheduling policy (e.g. FIFO, WFQ, DRR, etc), and (2) number of queues per output port.

Outputs: performance estimates of the NPDT: three matrices of size N*N containing the delay, jitter and loss for all the paths in the input topology.

Note that this is an example of the inputs/outputs of a performance NPDT, but other inputs and outputs are possible depending on the specificities of each scenario.

5. Mapping to the Network Digital Twin Architecture

Since the NPDT is a type of Network Digital Twin, its elements can be mapped to the reference architecture of a NDT described in [I-D.draft-zhou-nmrg-digitaltwin-network-concepts]. Table 1 maps the elements of the NDT reference architecture to those of the NPDT. Note that the Physical Network is the same for both architectures.

Table 1: Mapping of NDT reference architecture elements to the architecture of the Network Performance DT.

6. Use Cases

6.1. Network Operations and Management

6.1.1. Network planning

The size and traffic of networks has doubled every year [network-capacity]. To accommodate this growth in users and network applications, networks need periodical upgrades. For example, ISPs might be willing to increase certain link capacities or add new connections to alleviate the burden on the existing infrastructure. This is typically a cumbersome process that relies on expert knowledge. Furthermore, modern networks are becoming larger and more complex, thus exacerbating the difficulty of existing solutions to scale to larger networks [planning-scalability].

Since the NPDT models large infrastructures and can produce accurate and fast performance estimates, it can help in different tasks related to network capacity and planning:

* Estimating when an existing network will run out of resources, assuming a given growth in users.

* Use performance estimates to plan the optimal upgrade that can cope with user growth. Network operators can leverage the NPDT to make better planning decisions and anticipate network upgrades.

* Find unconventional topologies: in some networking scenarios, especially datacenter networks, some topologies are well-known to offer high performance [Google-Clos]. However, it is also possible to search for new topologies that optimize performance with the help of algorithms. On one hand, the algorithm explores different topologies and, on the other hand, the NPDT provides fast performance estimations to the algorithm. Hence, the NPDT guides the optimization algorithm towards the topologies with better performance [auto-dc-topology].

6.1.2. What-if scenarios

The NPDT is a unique tool to perform what-if analysis, that is, analyze the impact of potential scenarios and configurations safely without any impact on the real network. In this context, the NPDT acts as a safe sandbox where different configurations are applied to the NPDT to understand their impact on the network. Some examples of What-if analysis are:

* What is the impact in my network performance if we acquire company ACME and we incorporate all its employees?

* When will the network run out of capacity if we have an organic growth of users?

* What is the optimal network hardware upgrade given a budget?

* We need to update this path. What is the impact on the performance of the other flows?

* A particular day has a spike of 10% in traffic intensity. How much loss will it introduce? Can we reduce this loss if we rate- limit another flow?

* How many links can fail until the SLA is degraded?

* What happens if link B fails? Is the network able to process the current traffic load?

6.1.3. Troubleshooting

There are many factors that cause network failures (e.g., invalid network configurations, unexpected protocol interactions). Debugging modern networks is complex and time consuming. Currently, troubleshooting is typically done by human experts with years of experience using networking tools.

Network operators can leverage a NPDT to reproduce previous network failures, in order to find the source of service disruptions. Specifically, network operators can replicate past network failure scenarios and analyze their impact on network performance, making it easier to find specific configuration errors. In addition, the NPDT helps in finding more robust network configurations that prevent service disruptions in the future.

6.1.4. Anomaly detection

Since the NPDT models the behaviour of a real-world network, network operators have access to an estimation of the expected network behaviour. When the real-world network behaviour deviates from the NPDT’s behaviour, it can act as an indicator of an anomaly in the real-world network. Such anomalies can appear at different places in a network (e.g., core, edge, IoT), and different data sources can be used to detect such anomalies.

6.1.5. Training

As discussed before, the NPDT can be understood as a safe playground where misconfigurations don’t affect the real-world system performance. In this context, the NPDT can play an important role in improving the education and certification process of network professionals, both in basic networking training and advanced scenarios. For example:

* In basic network training, understand how routing modifications impact delay.

* In more advanced studies, showcase the impact of scheduling configuration on flow performance, and how to use them to optimize SLAs.

* In cybersecurity scenarios, evaluate the effects of network attacks and possible counter-measures.

6.2. Network Optimization

Since the DT can provide performance estimates in short timescales, it is possible to pair it with a network optimizer (Figure 2). The network administrator defines one or more optimization objectives e.g. maximum average delay for all paths in the network. The optimizer can be implemented with a classical optimization algorithm, like Constraint Programming [DEFO], or Local Search [LS], or a Machine-Learning one, such as Deep Neural Networks [DNN-TM], or Multi-Agent Reinforcement Learning [MARL-TE]. Regardless of the implementation, the optimizer tests various configurations to find the network configuration parameters that satisfy the optimization objectives. In order to know the performance of a specific network configuration, the optimizer sends such configuration to the NPDT, that predicts the performance metrics of such configuration.

+------------+ Candidate +-------------+ | | Network Config. | Network | Optimization----> | Network |------------------->| Performance | objectives | Optimizer | | Digital | | |<-------------------| Twin | +------------+ Estimated +-------------+ | Performance | | v Optimized Network Configuration

Figure 2: Using a NPDT as a network model for an optimizer.

An example of optimization use case would be multi-objective optimization scenarios: commonly, the network administrator defines a set of optimization goals that must be concurrently met [DEFO], for example:

* Bound the latency of all links to a maximum.

* Do not exceed a link utilization of 80%, but for only a sub-set of all the links.

* Route all flows of type B through node 10.

* Avoid more than 35 Gbps of traffic to router R5.

* Minimize the routing cost, that is, the number of flow to re-route [ReRoute-Cost].

7. Implementation Challenges

This section presents different technologies that can be used to build a NPDT, and details the advantages and disadvantages of using them to implement a NPDT. It takes into account how they perform with respect to the requirements of accuracy, speed, and scale of the NPDT predictions.

7.1. Simulation

Packet-level simulators, such as OMNET++ [OMNET] and NS-3 [ns-3] simulate network events. In a nutshell, they simulate the operation of a network by processing a series of events, such as the transmission of a packet, enqueuing and dequeuing packets in the router, etc. Hence, they offer excellent accuracy when predicting network performance metrics (delay, jitter and loss), but they take a significant amount of time to run the simulation. They scale linearly with number of packets to simulate.

In fact, the simulation time depends on the number of events to process [limitations-net-sim]. This limits the scalability of simulators, even if the topology does not change: increasing traffic intensities will take longer to simulate because more packets enter the network per unit of time. Conversely, simulating the same traffic intensity in larger topologies will also increase the simulation time. For example, consider a simulator that takes 11 hours to process 4 billion events (these values are obtained from an actual simulation). Although 4 billion events may appear a large figure, consider:

* A 1 Gbps ethernet link, transmitting regular frames with the maximum of 1518 bytes.

* This translates to approx. 82k packets crossing the link per second.

* Assuming a network with 50 links, and that the transmission of a packet over a link equals to a single event a in the simulator, such network translates to 82k packets/s/link * 50 links * 1 event/packet ˜ 4 million events to simulate one second of network activity.

* Then, with a budget of 4 billion events, it takes 11 hours to simulate only 16 minutes of network activity.

These figures show that, despite the high accuracy of network simulators, they take too much time to calculate performance estimations.

7.2. Emulation

Network emulators run the original network software in a virtualized environment. This makes them easy to deploy, and depending on the emulation hardware, they can produce reasonably fast estimations. However, for large scale networks their speed will eventually decrease because they are not using specific hardware built for networking. For fully-virtualized networks, emulating a network requires as many resources as the real one, which is not cost- effective.

In addition, some studies have reported variable accuracy depending on the emulation conditions, both the parameters and underlying hardware and OS configurations [emulation-perf]. Hence, emulators show some limitations if we want to build a fast and scalable NPDT. However, emulators are useful in other use cases, for example in training, debugging, or testing new features.

7.3. Analytical Modelling

Queueing Theory (QT) is an analytical tool that models computer networks as a series of queues. The key advantage of QT is its speed, because the calculations rely on mathematical equations. QT is arguably the most popular modeling technique, where networks are represented as interconnected queues that are evaluated analytically. This represents a well-established framework that can model complex and large networks.

However, the main limitation of QT is the traffic model: although it offers high accuracy for Poisson traffic models, it presents poor accuracy under realistic traffic models [qt-precision]. Internet traffic has been extensively analyzed in the past two decades, and despite the community has not agreed on a universal model, there is consensus that in general aggregated traffic shows strong autocorrelation and a heavy-tail [inet-traffic].

7.4. Neural Networks

Finally, Neural Networks (NN) and other Machine Learning (ML) tools are as fast as QT (in the order of milliseconds), and can provide similar accuracy to that of packet-level simulators. They represent an interesting alternative, but have two key limitations. First, they require training the NN with a large amount of data from a wide range of network scenarios: different routings, topologies, scheduling configurations, as well as link failures and network congestion. This dataset may not be always accessible, or easy to produce in a production network (see Section 8). Second, in order to scale to larger topologies and keep the accuracy, not all NN provide

sufficient accuracy, therefore, some use cases need custom NN architectures.

7.4.1. MultiLayer Perceptron

A MultiLayer Perceptron [MLP] is a basic kind of NN from the family of feedforward NN. In short, input data is propagated unidirectionally from the input layer of neurons through the output. There may be an arbitrary number of hidden layers between the input and output layer. They are widely used for basic ML applications, such as regression.

7.4.2. Recurrent Neural Networks

Recurrent Neural Networks [RNN] are a more advanced type of NN because they connect some layers to the previous ones, which gives them the ability to store state. They are mostly used to process sequential data, such as handwriting, text, or audio. They have been used extensively in speech processing [RNN-speech], and in general, Natural Language Processing applications [NLP].

7.4.3. Convolutional Neural Networks

Convolutional Neural Networks (CNN), are a Deep Learning NN designed to process structured arrays of data such as images. CNNs are highly performant when detecting patterns in the input data. This makes them widely used in computer vision tasks, and have become the state of the art for many visual applications, such as image classification [CNN-images]. Hence, their current design presents limited applicability to computer networks.

7.4.4. Graph Neural Networks

Graph Neural Networks [GNN] are a type of neural network designed to work with graph-structured data. A relevant type of GNN with interesting characteristics for computer networks are Message Passing Neural Networks (MPNN). In a nutshell, MPNN exchanges a set of messages between the graph nodes in order to understand the relationship between the input graph and the expected outputs of the training dataset. They are composed of three functions, that are repeated several iterations, depending on the size of the graph:

* Message: encodes information about the relationship of two contiguous elements of the graph in a message (an n-element array).

* Aggregation: combines the different messages received on a particular node. It is typically an element-wise summation. The result is an array of constant length, independently of the number of received messages.

* Update: combines the hidden states of a node with the aggregated message. The result of this function is used as input to the next message-passing iteration.

Note that the internal architecture of a MPNN is re-build for each input graph.

Such ability to understand graph-structured data naturally renders them interesting for a Network Performance Digital Twin. Since computer networks are fundamentally graphs, they have the potential to take as input a graph of the network, and produce as output performance estimations of such the input network [qt-precision].

7.4.5. NN Comparison

Figure 3 presents a comparison of different types of NN that predict the delay of a given input network. We use a dataset of the performance of different network topologies, created with simulation data (i.e, ground truth) from OMNET++. We measure the error relative to the delay of the simulation data. In order to evaluate how well the different NN deal with different network topologies, we train each NN in three different scenarios:

* Same topology: the training and testing datasets contain the same network topologies.

* Different topology: the training and testing datasets contain different sets of network topologies. The objective is determining if the NN keeps the same performance if we show it a topology it has never seen.

* Link failures: here we remove a random link from the topology.

+----------------------------------------------------------+ | Mean Average Percentage Error of the delay prediction | +----------------------+-----------------------------------+ | Scenario | MLP | RNN | GNN | +----------------------+-----------+-----------+-----------+ | Same topology | 0.123 | 0.1 | 0.020 | | Different topology | 11.5 | 0.305 | 0.019 | | Link failures | 1.15 | 0.638 | 0.042 | +----------------------+-----------+-----------+-----------+

Figure 3: Performance comparison of different NN architectures

We can see that all NNs predict with excellent accuracy the network delay if we don’t change the topology used during training. However, when it comes to new topologies, the error of the MLP is unacceptable (1150 %), as well as the RNN, around 30%. On the other hand, the GNN can understand new topologies, with an error below 2%. Similarly, if a link fails, the RNN has difficulties offering accurate predictions (60% error), while the GNN maintains the accuracy (4.2%). These results show the potential of GNNs to build a Network Performance Digital Twin.

8. Training

In the context of Digital Twins based on Machine Learning, they require a training process before they can be deployed. Commonly, the training process makes use of a dataset of inputs and expected outputs, that guides the training process to adjust the internal architecture of e.g. the neural network. There are some caveats regarding the training process:

* In order to obtain sufficient accuracy, the training dataset needs to be representative, that is, contain samples of a wide range of possible inputs and outputs. In networks, this translates to samples of a congested network, with a link failure, etc. Otherwise, the resulting algorithm cannot predict such situations.

* Taking the latter into account, this means that some kind of samples, e.g. those of a congested or disrupted network are difficult to obtain from a production network.

* A way to acquire those samples is in a testbed, although it may not be possible for some networks, especially those of large scale. A possible solution in this situation is developing Neural Networks that are invariant to some of the metrics of the graph, e.g. number of nodes. That is, the NN does not lose accuracy if the number of nodes increases. This makes it possible to train the NN in a testbed, and then deploy it in a network that is larger than the testbed without losing accuracy.

This memo includes no request to IANA.

An attacker can alter the software image of the NPDT. This could produce inaccurate performance estimations, that could result in network misconfigurations, disruptions or outages. Hence, in order to prevent the accidental deployment of a malicious NPDT, the software image of the NPDT MUST be digitally signed by the vendor.

11. References

[OMNET] "https://omnetpp.org/", 2022.

[ns-3] "https://www.nsnam.org/", 2022.

[P4Rspec] "https://p4.org/p4-spec/p4runtime/main/P4Runtime- Spec.html", 2021.

[OFspec] "TS-025: OpenFlow Switch Specification https://opennetworking.org/wp-content/uploads/2014/10/ openflow-switch-v1.5.1.pdf", 2015.

[NetworkXlib] "https://networkx.org/", 2022.

[openconfig-rtgwg-gnmi-spec-01] Shakir, R., Shaikh, A., Borman, P., Hines, M., Lebsack, C., and C. Morrow, "gRPC Network Management Interface (gNMI)", March 2018, <https://datatracker.ietf.org/doc/html/draft-openconfig- rtgwg-gnmi-spec-01>.

[RFC8040] Bierman, A., Bjorklund, M., and K. Watsen, "RESTCONF Protocol", RFC 8040, DOI 10.17487/RFC8040, January 2017, <https://www.rfc-editor.org/info/rfc8040>.

[RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., and A. Bierman, Ed., "Network Configuration Protocol (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, <https://www.rfc-editor.org/info/rfc6241>.

[RFC6830] Farinacci, D., Fuller, V., Meyer, D., and D. Lewis, "The Locator/ID Separation Protocol (LISP)", RFC 6830, DOI 10.17487/RFC6830, January 2013, <https://www.rfc-editor.org/info/rfc6830>.

[RFC4655] Farrel, A., Vasseur, J.-P., and J. Ash, "A Path Computation Element (PCE)-Based Architecture", RFC 4655, DOI 10.17487/RFC4655, August 2006, <https://www.rfc-editor.org/info/rfc4655>.

[RFC7047] Pfaff, B. and B. Davie, Ed., "The Open vSwitch Database Management Protocol", RFC 7047, DOI 10.17487/RFC7047, December 2013, <https://www.rfc-editor.org/info/rfc7047>.

[RFC3954] Claise, B., Ed., "Cisco Systems NetFlow Services Export Version 9", RFC 3954, DOI 10.17487/RFC3954, October 2004, <https://www.rfc-editor.org/info/rfc3954>.

[I-D.draft-zhou-nmrg-digitaltwin-network-concepts] Zhou, C., Yang, H., Duana, X., Lopez, D., Pastor, A., Wu, Q., Boucadir, M., and C. Jacquenet, "Digital Twin Network: Concepts and Reference Architecture", Work in Progress, Internet-Draft, draft-zhou-nmrg-digitaltwin-network- concepts-06, 2 December 2021, <https://datatracker.ietf.org/doc/html/draft-zhou-nmrg- digitaltwin-network-concepts-06>.

[irtf-nmrg-ibn-concepts-definitions-09] Clemm, A., Ciavaglia, L., Granville, L. Z., and J. Tantsura, "Intent-Based Networking - Concepts and Definitions", March 2022, <https://datatracker.ietf.org/doc/html/draft-irtf-nmrg- ibn-concepts-definitions-09>.

[digital-twin-5G] Nguyen, H. X., Trestian, R., To, D., and M. Tatipamula, "Digital Twin for 5G and Beyond", 2021, <https://doi.org/10.1109/MCOM.001.2000343>.

[digital-twin-vanets] Zhao, L., Han, G., Li, Z., and L. Shu, "Intelligent Digital Twin-Based Software-Defined Vehicular Networks", 2020, <https://doi.org/10.1109/MNET.011.1900587>.

[digital-twin-industry] Groshev, M., Guimarães, C., Martín-Pérez, J., and A. D. L. Oliva, "Toward Intelligent Cyber-Physical Systems: Digital Twin Meets Artificial Intelligence", 2021, <https://doi.org/10.1109/MCOM.001.2001237>.

[streaming-telemetry] Gupta, A., Harrison, R., Canini, M., Feamster, N., Rexford, J., and W. Willinger, "Sonata: Query-Driven Streaming Network Telemetry", 2018, <https://doi.org/10.1145/3230543.3230555>.

[network-capacity] Ellis, A. D., Suibhne, N. M., Saad, D., and D. N. Payne, "Communication networks beyond the capacity crunch", 2016, <https://royalsocietypublishing.org/doi/abs/10.1098/ rsta.2015.0191>.

[planning-scalability] Zhu, H., Gupta, V., Ahuja, S. S., Tian, Y., Zhang, Y., and X. Jin, "Network Planning with Deep Reinforcement Learning", 2021, <https://doi.org/10.1145/3452296.3472902>.

[limitations-net-sim] Rampfl, S., "Network simulation and its limitations", 2013, <https://doi.org/10.2313/NET-2013-08-1_08>.

[emulation-perf] Jurgelionis, A., Laulajainen, J., Hirvonen, M., and A. I. Wang, "An Empirical Study of NetEm Network Emulation Functionalities", 2011, <https://doi.org/10.1109/ICCCN.2011.6005933>.

[qt-precision] Ferriol-Galmés, M., Rusek, K., Suárez-Varela, J., Xiao, S., Cheng, X., Barlet-Ros, P., and A. Cabellos-Aparicio, "RouteNet-Erlang: A Graph Neural Network for Network Performance Evaluation", 2022, <https://arxiv.org/abs/2202.13956>.

[inet-traffic] Popoola, J. and R. Ipinyomi, "Empirical Performance of Weibull Self-Similar Tele-traffic Model", 2017.

[MLP] Pal, S. and S. Mitra, "Multilayer perceptron, fuzzy sets, and classification", 1992, <https://doi.org/10.1109/72.159058>.

[RNN] Hochreiter, S. and J. Schmidhuber, "Long Short-Term Memory", 1997, <https://doi.org/10.1162/neco.1997.9.8.1735>.

[RNN-speech] Mikolov, T., Kombrink, S., Burget, L., ernocký, J., and S. Khudanpur, "Extensions of recurrent neural network language model", 2011, <https://doi.org/10.1109/ICASSP.2011.5947611>.

[GNN] Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and G. Monfardini, "The Graph Neural Network Model", 2009, <https://doi.org/10.1109/TNN.2008.2005605>.

[DEFO] Hartert, R., Vissicchio, S., Schaus, P., Bonaventure, O., Filsfils, C., Telkamp, T., and P. Francois, "A Declarative and Expressive Approach to Control Forwarding Paths in Carrier-Grade Networks", 2015, <https://doi.org/10.1145/2785956.2787495>.

[facebook-config] Sung, Y. E., Tie, X., Wong, S. H., and H. Zeng, "Robotron: Top-down Network Management at Facebook Scale", 2016, <https://doi.org/10.1145/2934872.2934874>.

[auto-dc-topology] Salman, S., Streiffer, C., Chen, H., Benson, T., and A. Kadav, "DeepConf: Automating Data Center Network Topologies Management with Machine Learning", 2018, <https://doi.org/10.1145/3229543.3229554>.

[CNN-images] Krizhevsky, A., Sutskever, I., and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks", 2012, <https://proceedings.neurips.cc/paper/2012/file/ c399862d3b9d6b76c8436e924a68c45b-Paper.pdf>.

[MARL-TE] Bernárdez, G., Suárez-Varela, J., López, A., Wu, B., Xiao, S., Cheng, X., Barlet-Ros, P., and A. Cabellos-Aparicio, "Is Machine Learning Ready for Traffic Engineering Optimization?", 2021, <https://doi.org/10.1109/ICNP52444.2021.9651930>.

[LS] Gay, S., Hartert, R., and S. Vissicchio, "Expect the unexpected: Sub-second optimization for segment routing", 2017, <https://doi.org/10.1109/INFOCOM.2017.8056971>.

[DNN-TM] Valadarsky, A., Schapira, M., Shahaf, D., and A. Tamar, "Learning to Route", 2017, <https://doi.org/10.1145/3152434.3152441>.

[ReRoute-Cost] Zheng, J., Xu, Y., Wang, L., Dai, H., and G. Chen, "Online Joint Optimization on Traffic Engineering and Network Update in Software-defined WANs", 2021, <https://doi.org/10.1109/INFOCOM42981.2021.9488837>.

[NLP] Chowdhary, K. R., "Natural Language Processing", 2020, <https://doi.org/10.1007/978-81-322-3972-7_19>.

[Google-Clos] Singh, A., Ong, J., Agarwal, A., Anderson, G., Armistead, A., Bannon, R., Boving, S., Desai, G., Felderman, B., Germano, P., Kanagala, A., Provost, J., Simmons, J., Tanda, E., Wanderer, J., H\"{o}lzle, U., Stuart, S., and A. Vahdat, "Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network", 2015, <https://doi.org/10.1145/2785956.2787508>.

[digital-twin-AI] Mozo, A., Karamchandani, A., Gómez-Canaval, S., Sanz, M., Moreno, J. I., and A. Pastor, "B5GEMINI: AI-Driven Network Digital Twin", 2022, <https://www.mdpi.com/1424-8220/22/11/4106>.

Acknowledgements

Jordi Paillisse UPC-BarcelonaTech c/ Jordi Girona 1-3 08034 Barcelona Catalonia Spain Email: jordi.paillisse@upc.edu

Paul Almasan UPC-BarcelonaTech c/ Jordi Girona 1-3 08034 Barcelona Catalonia Spain Email: felician.paul.almasan@upc.edu

Miquel Ferriol UPC-BarcelonaTech c/ Jordi Girona 1-3 08034 Barcelona Catalonia Spain Email: miquel.ferriol@upc.edu

Pere Barlet UPC-BarcelonaTech c/ Jordi Girona 1-3 08034 Barcelona Catalonia Spain Email: pere.barlet@upc.edu

Albert Cabellos UPC-BarcelonaTech c/ Jordi Girona 1-3 08034 Barcelona Catalonia Spain Email: alberto.cabellos@upc.edu

Shihan Xiao Huawei China Email: xiaoshihan@huawei.com

Xiang Shi Huawei China Email: shixiang16@huawei.com

Xiangle Cheng Huawei China Email: chengxiangle1@huawei.com

Diego Perino Telefonica I+D Barcelona Spain Email: diego.perino@telefonica.com

Diego Lopez Telefonica I+D Seville Spain Email: diego.r.lopez@telefonica.com

Antonio Pastor Telefonica I+D Madrid Spain Email: antonio.pastorperales@telefonica.com

Internet Research Task Force D. ChenInternet-Draft H. YangIntended status: Informational K. YaoExpires: 7 January 2023 China Mobile G. Fioccola Q. Wu Huawei 6 July 2022

Network measurement intent - one of IBN use cases draft-yang-nmrg-network-measurement-intent-05

Abstract

As an important technical means to detect network state, network measurement has attracted more and more attention in the development of network. However, the current network measurement technology has the problem that the measurement method and the measurement purpose cannot match well. To solve this problem, this memo introduces network measurement intent, presents a process of scheduling the network resource and measurement task to meet the user or network operator’s needs. And it can be seen as a specific use case of intent based network.

Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

Chen, et al. Expires 7 January 2023 [Page 1]

Internet-Draft Network Working Group July 2022

Copyright Notice

Table of Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Definitions and Acronyms . . . . . . . . . . . . . . . . . . 3 3. Relationship to Existing Documents . . . . . . . . . . . . . 4 4. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5. Concrete Examples . . . . . . . . . . . . . . . . . . . . . . 7 5.1. Time Accuracy Measurement . . . . . . . . . . . . . . . . 8 5.2. Spatial Accuracy Measurement . . . . . . . . . . . . . . 10 6. Classification of NMI . . . . . . . . . . . . . . . . . . . . 12 6.1. Static NMI . . . . . . . . . . . . . . . . . . . . . . . 12 6.2. Dynamic NMI . . . . . . . . . . . . . . . . . . . . . . . 12 7. Security Considerations . . . . . . . . . . . . . . . . . . . 12 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 13 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 9.1. Normative References . . . . . . . . . . . . . . . . . . 13 9.2. Informative References . . . . . . . . . . . . . . . . . 14 Authors’ Addresses . . . . . . . . . . . . . . . . . . . . . . . 15

1. Introduction

With the rapid growth of the present network, the scale of the network increases, while users’ service requirements for the network are getting strict and diversified,e.g., both loss requirements and throughput needs to be met simultaneously. At the same time, network resources growth is hard to meet user’s service requirments. In order to make good of network resources and improve utilization of the bandwidth, it becomes necessary to understand the current running state of the network, and network measurement, as a technical means to detect the network resource change, has been paid of more attention. The continuous development of network measurement technology has also increases higher precision of network awareness. However, both the traditional network measurement technology (e.g.,loss measurement and delay measurement defined in (RFC 2679

[RFC2679]RFC 2680 [RFC2680]) and the network telemetry technology RFC 8639 [RFC8639]RFC 8641 [RFC8641][I-D.ietf-netconf-adaptive-subscription], which has emerged with the development of software-defined network in recent years, need to consume more network resources when detecting the network state changes and feeding back the detection results. Therefore, to some extent, the choice of network measurement methods, in addition to different accuracy of measurement results, will also cause different level of network load to the network.

In order to balance the accuracy of network measurement results with the network load, it is very important to choose the appropriate network measurement method according to the different requirements of network measurement. As a result, accurate on-demand network measurement technology is becoming more and more important. At the same time, the development of Intent based Network (IBN) enables the network to be configured according to users’ or network administrators’ intent. Therefore, we can integrate the network measurement with IBN, i.e., the users’ or network administrators’ perceived demand for network state is regarded as network measurement intent.

Our proposed approach is to use the network measurement intent to achieve network performance acquisition based on user/network administrator intent- , verify whether network measurement results meet the measurement intent, and further improve the accuracy of the configuration in IBN.

CLI: Command-line Interface.

IBN: Intent based Network.

Policy: A set of rules that governs the choices in behavior of a system.

NMI: Network Measurement Intent, refers to based on user/network operator’s demand for network status, and automatically collect network status information on demand.

SLA: Service Level Agreement.

3. Relationship to Existing Documents

As the rise of IBN, different groups have different definitions of The intent. For example, ONF [ONOS] defines intent is represented as a list of CLI modes that allows users to pass low-level details on the network; and there are two active RG drafts in the NMRG right now, Intent-Based Networking - Concepts and Definitions, [I-D.irtf-nmrg-ibn-concepts-definitions] solves the problem that "What is an intent?" and[I-D.irtf-nmrg-ibn-intent-classification]solves the problem "Given a specific intent, how to parse/disassemble it from different angles?".

Naturally, the question that needs to be solved after concept definition should be "How to realize an specific intent?".[I-D.irtf-nmrg-ibn-intent-classification]can be considered as the first step of realization of a given intent, however, it is not enough. Some other issues should be clarified, like" whether the input intent is valid or not?" , "What would the IBN system do when the result is not acceptable?", "If the result is not acceptable, does human/operator interference required?"... We should take a specific IBN use case for illustration of the realization procedure, so we will take the network measurement intent as an example.

Referring to the taxonomy of intent proposed in [I-D.irtf-nmrg-ibn-intent-classification], the network measurement intent can be classified into different categories.

Solution: the intent could cover carrier and data center.

Intent user type: customer.

Intent type: customer service intent.

Intent scope: Application, QoS.

Network scope: Radio Access, Transport, Edge, Core.

Abstraction: Non-technical.

Lifecycle Requirements: transient.

In order to integrate the NMI with the IBN, in this document we define the components of the NMI interactive process as follows:

* NMI Recognition and Acquisition

* NMI Translation

* NMI Policy

* NMI Orchestration and pre-Verification

* Data Collection and Analytics

* NMI Compliance Assessment

4. Overview

As mentioned above, NMI refers to the on-demand measurement of the network state based on the user/network operators’ perceived intent of the network state.The user/network operators’ perceived intent is usually in the form of service level objective or service level expectation. We will take the measurement of the performance of the network overwhelming with the network traffic as a simple example and present the detailed interactive process for those components defined in section 3.

* NMI Recognition and Acquisition.

- In this function, NMI will be recognized by "ingesting" users’ or network operators’ measurement intent. They have the ability to identify the NMI of a certain network performance that users want to measure, such as delay, jitter, etc., and at the same time allow users to express the NMI of network performance in a variety of interactive ways to ensure the accuracy of the identification of the NMI. To achieve this functionality, such an interaction requires the use of the intent-northbound interface defined in the IBN,e.g., service interface model in [RFC8299][RFC8466] or intent interface defined in [TMF1253A].

* NMI Translation.

- In this function, NMI needs to be translated into corresponding measurement policy, which includes but is not limited to network performance parameters to be measured (such as delay, jitter, and packet loss), time period to be measured, and measurement unit. For a simple example, in the measurement of busy network performances, due to dynamic changes of network characteristics, such as daily network bandwidth utilization rate, the period of network busy time is not fixed. As a result, NMI Policy generated by NMI Translation can determine the threshold when the network state is busy or the network is congested on the same day based on the historical data learned by AI.

* NMI Policy

- In this function, NMI policy needs to be translated into actions and instructioninvoked against the specified network element. Therefore, NMI policy generated by NMI Translation must be executable, that is, corresponding underlying network devices must be able to support policy execution. If the generated policy cannot be executed by the underlying device, the policy needs to be adjusted. And if the measurement results cannot meet the service requirements set by the users and network operators, the policy also needs to be adjusted.

* NMI Orchestration and pre-Verification.

- In this function, according to the previous NMI Translation and NMI Policy step, NMI Orchestration and pre-Verification determines the measurement scheme according to the measurement policy generated by NMI Policy, and pre-verifies whether the measurement scheme is feasible.

- Take busy time network measurement as an example, besides choosing of measurement schemes and assigning measurement tasks [RFC8639][RFC8641][I-D.ietf-netconf-adaptive-subscription][RFC8 194][I-D.ietf-netmod-eca-policy], it also needs to determine whether the network is busy according to the current network state. In addition, this function performs automatic network deployment,e.g.,using model driven network management approach defined in [RFC8969].

* Data Collection and Analytics.

- In NMI, data collection and analysis should be based on the selected measurement scheme and parameters set to be measured that determined in previous steps, automatically realize the collection on demand, and generate corresponding data analysis results.

* NMI Compliance Assessment.

- At the end, this function verifies whether the results meets the service requirement and whether the NMI is satisfied. If either of the two conditions is not satisfied, the NMI should be modified and re-enter the NMI Policy.

And he measurement flow diagram is shown as the following figure:

5. Concrete Examples

In this section, we will take SLA measurement intent as an example to illustrate each step of the process.

With the development of measurement technology in recent years, network measurement methods can be divided into active measurement, passive measurement and a hybrid measurement [RFC7799]. No matter which measurement technology is used, the network resource consumption will be influenced by the network condition and change over the time.e.g.,, if the transmission frequency of active measurement message is too fast, it will occupy too much bandwidth resources and affect the normal operation of actual business. While if the transmission frequency is too slow, some instantaneous network anomalies will be missed and the network status cannot be accurately reflected. Passive measurement requires real- time collection of actual business data. If the sampling rate is too high, a large amount of data will be accumulated in a short time [I-D.ietf-netconf-adaptive-subscription].The analysis system for real-time analysis of these data needs strong processing capacity; if the sampling rate is too low, some network anomalies will also be omitted.

How to balance and accurately measure the network state, especially the abnormal network affecting the service, while occupying as little network bandwidth as possible, and the processing capacity of the data analysis system is not high, this is the function that the NMI scheme based on IBN should realize.

In this section, we will consider two examples to illustrate each step of the process.

5.1. Time Accuracy Measurement

Taking network SLA performance metric -- delay measurement as an example, the simple schematic diagram is as follows, different thresholds, warning value and alert value should be set for network delay in advance. When the delay value is below warning, the network is normal and the business is normal. When the delay is between warning value and alert value, the network fluctuation is abnormal, but the business is normal. When the delay exceeds the alert value, both the network and business are abnormal. For delay in different thresholds, different measurement strategies should be adopted:

* When the network delay exceeds the alert value, or when the historical data predict that the delay will exceed the alert value, passive measurement requires 100% sampling of business data, and the transmission frequency of active measurement is modulated to the maximum. At the same time, the log and alarm data of the whole network equipment are collected to realize the most fine-grained measurement of the network, locate the root cause of the problem and repair the network in time.

* When the network delay exceeds warning value but is lower than alert value, passive measurement samples 60% of business data, and the transmission message frequency of the active measurement is adjusted to the median value, and the running state data of some key devices in the network is collected synchronously.

* When the network delay is less than warning value, passive measurement data is sampled at 20%, and active measurement message frequency is adjusted to the lowest, and the network equipment running state of key nodes can be collected as needed.

Based on the above SLA time delay index measurement, different thresholds adopt different measurement strategies, the concrete steps of SLA measurement intent are as follows:

* In NMI Recognition and Acquisition, SLA measurement intent is recognized, and business requirements and performance metrics are identified by interacting with users. Then the NMI Recognition and Acquisition module inputs the SLA measurement intent into the NMI Translation module.

* The NMI Translation module consolidates the SLA measurement intent with the measurement policy in NMI Policy, and outputs the executable measurement policy, such as the message transmission frequency of active measurement, the sampling rate of passive measurement, the collection range of equipment running state, etc.

* The NMI Orchestration and pre-Verification module uses the measurement policy as input and for orchestration layer which is responsible for translating it into the specific configuration and execution time of each device in the tested network. The NMI Orchestration and pre-Verification module verifies the implementation of the policy in the equipment and pre-analyzes the measurement results.

* The Data Collection and Analysis module will collect the measurement data according to the configuration and execution time requirements of the previous step, make a simple analysis of the collected data (e.g.,verify the correctness of the measurement data), and then send the collected measurement data to the NMI Compliance Assessment module. After that, the NMI Compliance Assessment module feedbacks the measurement results (e.g., the measurement results match user intent) to the user to complete the closed loop of the measurement task.

* The NMI Compliance Assessment module evaluates whether the actual measurement results are in line with the user’s intent. If they are, the results will be fed back. If they are not, the NMI Policy module will be informed to adjust the policy, and then the measurement will be restarted. According to the measurement results, the NMI Compliance Assessment module notifies the NMI Orchestration and pre-Verification module to modify the execution time of the policy in time, and at the same time updates the measured results to the delay history database to improve the accuracy of delay prediction.

5.2. Spatial Accuracy Measurement

The desired approach is to accurately measure the network state, especially when there are some issues affecting the service, but at the same time, reduce the resources to be employed to achieve the desired accuracy.

In this regard, the Clustered Alternate-Marking framework[RFC8889] adds flexibility to Performance Measurement (PM), because it can reduce the order of magnitude of the packet counters. This allows the NMI Orchestration and pre-Verification module to supervise, control, and manage PM in large networks.

[RFC8889] introduces the concept of cluster partition of a network. The monitored network can be considered as a whole or split into clusters that are the smallest subnetworks (group-to-group segments), maintaining the packet loss property for each subnetwork. The clusters can be combined in new connected subnetworks at different levels, forming new clusters, depending on the level of detail to achieve.

The clustered performance measurement intent represents the spatial accuracy, that is the size of the subnetworks to consider for the monitoring. It is possible to start without examining in depth and, in case of necessity, the "network zooming" approach can be used.

This approach called "network zooming" and can be performed in two different ways:

1. change the traffic filter and select more detailed flows;

2. activate new measurement points by defining more specified clusters.

The network-zooming approach implies that some filters, rules or flow identifiers are changed. But these changes must be done in a way that do not affect the performance. Therefore there could be a transient time to wait once the new network configuration takes effect. Anyway, if the performance issue is relevant, it is likely to last for a time much longer than the transient time.

The concrete steps of the clustered performance measurement intent are as follows:

* In NMI Recognition and Acquisition, the clustered performance measurement intent is recognized. Then the NMI Recognition and Acquisition module inputs the clustered performance measurement intent into the NMI Translation module.

* The NMI Translation module analyzes the clustered performance measurement intent and outputs the executable measurement policy, such as network partition and the spatial accuracy for the monitoring.

* The NMI Orchestration and pre-Verification module arranges and calibrates the measurement with the specific configuration to split the whole network into clusters at different levels.

* The Data Collection and Analysis module collects the measurement data from the different clusters, and then send these data to the NMI Compliance Assessment module. It verifies the performance for each cluster and send the measurement results to the user.

* The NMI Compliance Assessment module, in case a cluster is experiencing a packet loss or the delay is high, notifies the NMI Orchestration and pre-Verification module to modify the cluster partition of the network for further investigation. The network configuration can be immediately modified in order to perform a new partition of the network but only for the cluster with bad performance. In this way, the problem can be localized with successive approximation up to a flow detailed analysis. This is the so-called "closed loop" performance management.

6. Classification of NMI

In this section, we divide the network measurement intent into static NMI and dynamic NMI according to different requirement characteristics.

6.1. Static NMI

Static NMI refers to the measurement purposes remain unchanged and is independent of the network state/external environment. Static NMI can be translated into determined network performance indicator values, such as concrete delay values, network bandwidth utilization, throughput and so on.

Because the static NMI can be translated into the measurement of the determined network performance parameters, the whole process is relatively simple and error-prone, and only needs to verify whether the measurement results meet the requirements.

6.2. Dynamic NMI

Dynamic NMI refers to the measurement purpose remains unchanged but the measurement process changes dynamically according to the network state/external environment. Dynamic NMI can also be translated into the measurement of determined network performance parameters, however, the values of network performance parameters will change with the changes of network states and external environment.

For example, the measurement of busy network performances mentioned in the previous section. Although the corresponding network parameters for judging whether the network is busy are determined, the corresponding network parameters have different values according to different network states and external environments.

Due to the dynamic nature of dynamic NMI, its processing process is more complex than static NMI. It is not only necessary to verify the accuracy of demand analysis, but also to verify whether the final measurement results meet the requirements.

This document introduces the network measurement intent, and uses two concrete examples to illustrate the process of network measurement intent. On the basis of existing intent work, this document can be used as a use case for IBN.

[I-D.irtf-nmrg-ibn-concepts-definitions]provides a comprehensive discussion of security considerations in the context of IBN, which are generally applicable also to the network measurement intent discussed in this document.

9. References

[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, <https://www.rfc-editor.org/info/rfc2119>.

[RFC2679] Almes, G., Kalidindi, S., and M. Zekauskas, "A One-way Delay Metric for IPPM", RFC 2679, DOI 10.17487/RFC2679, September 1999, <https://www.rfc-editor.org/info/rfc2679>.

[RFC2680] Almes, G., Kalidindi, S., and M. Zekauskas, "A One-way Packet Loss Metric for IPPM", RFC 2680, DOI 10.17487/RFC2680, September 1999, <https://www.rfc-editor.org/info/rfc2680>.

[RFC7799] Morton, A., "Active and Passive Metrics and Methods (with Hybrid Types In-Between)", RFC 7799, DOI 10.17487/RFC7799, May 2016, <https://www.rfc-editor.org/info/rfc7799>.

[RFC8194] Schoenwaelder, J. and V. Bajpai, "A YANG Data Model for LMAP Measurement Agents", RFC 8194, DOI 10.17487/RFC8194, August 2017, <https://www.rfc-editor.org/info/rfc8194>.

[RFC8299] Wu, Q., Ed., Litkowski, S., Tomotaki, L., and K. Ogaki, "YANG Data Model for L3VPN Service Delivery", RFC 8299, DOI 10.17487/RFC8299, January 2018, <https://www.rfc-editor.org/info/rfc8299>.

[RFC8466] Wen, B., Fioccola, G., Ed., Xie, C., and L. Jalil, "A YANG Data Model for Layer 2 Virtual Private Network (L2VPN) Service Delivery", RFC 8466, DOI 10.17487/RFC8466, October 2018, <https://www.rfc-editor.org/info/rfc8466>.

[RFC8639] Voit, E., Clemm, A., Gonzalez Prieto, A., Nilsen-Nygaard, E., and A. Tripathy, "Subscription to YANG Notifications", RFC 8639, DOI 10.17487/RFC8639, September 2019, <https://www.rfc-editor.org/info/rfc8639>.

[RFC8641] Clemm, A. and E. Voit, "Subscription to YANG Notifications for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641, September 2019, <https://www.rfc-editor.org/info/rfc8641>.

[RFC8889] Fioccola, G., Ed., Cociglio, M., Sapio, A., and R. Sisto, "Multipoint Alternate-Marking Method for Passive and Hybrid Performance Monitoring", RFC 8889, DOI 10.17487/RFC8889, August 2020, <https://www.rfc-editor.org/info/rfc8889>.

[RFC8969] Wu, Q., Ed., Boucadair, M., Ed., Lopez, D., Xie, C., and L. Geng, "A Framework for Automating Service and Network Management with YANG", RFC 8969, DOI 10.17487/RFC8969, January 2021, <https://www.rfc-editor.org/info/rfc8969>.

[I-D.ietf-netconf-adaptive-subscription] Wu, Q., Song, W., Liu, P., Ma, Q., Wang, W., and Z. Niu, "Adaptive Subscription to YANG Notification", Work in Progress, Internet-Draft, draft-ietf-netconf-adaptive- subscription-00, 23 June 2022, <https://www.ietf.org/archive/id/draft-ietf-netconf- adaptive-subscription-00.txt>.

[I-D.ietf-netmod-eca-policy] Wu, Q., Bryskin, I., Birkholz, H., Liu, X., and B. Claise, "A YANG Data model for ECA Policy Management", Work in Progress, Internet-Draft, draft-ietf-netmod-eca-policy-01, 19 February 2021, <https://www.ietf.org/archive/id/draft- ietf-netmod-eca-policy-01.txt>.

[I-D.irtf-nmrg-ibn-concepts-definitions] Clemm, A., Ciavaglia, L., Granville, L. Z., and J. Tantsura, "Intent-Based Networking - Concepts and Definitions", Work in Progress, Internet-Draft, draft- irtf-nmrg-ibn-concepts-definitions-09, 24 March 2022, <https://www.ietf.org/archive/id/draft-irtf-nmrg-ibn- concepts-definitions-09.txt>.

[I-D.irtf-nmrg-ibn-intent-classification] Li, C., Havel, O., Olariu, A., Martinez-Julia, P., Nobre, J. C., and D. R. Lopez, "Intent Classification", Work in

Progress, Internet-Draft, draft-irtf-nmrg-ibn-intent- classification-08, 18 May 2022, <https://www.ietf.org/archive/id/draft-irtf-nmrg-ibn- intent-classification-08.txt>.

Danyang Chen China Mobile Beijing 100053 China Email: chendanyang@chinamobile.com

Kehan Yao China Mobile Beijing 100053 China Email: yaokehan@chinamobile.com

Giuseppe Fioccola Huawei Riesstrasse, 25 80992 Munich Germany Email: giuseppe.fioccola@huawei.com

Qin Wu Huawei 101 Software Avenue, Yuhua District Nanjing 210012 China Email: bill.wu@huawei.com

Internet Research Task Force H. YangInternet-Draft D. ChenIntended status: Informational China MobileExpires: 2 January 2023 1 July 2022

One-way delay measurement method based on Digital Twin Network draft-yc-nmrg-dtn-owd-measurement-00

Abstract

This document implements an accurate network delay measurement method based on the digital twin network. This method does not need to send measurement packets, change the physical network configuration, change the format of service packets, and do not require physical network elements to support the time synchronization protocol. Two- way delay and one-way delay measurement of any service packet.The digital twin network architecture of this document follows the NMRG working group paper draft-irtf-nmrg-network-digital-twin-arch-00.

Status of This Memo

Copyright Notice

Yang & Chen Expires 2 January 2023 [Page 1]

Internet-Draft Digital Twin Network July 2022

Table of Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Conventions Used in This Document . . . . . . . . . . . . . . 4 2.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 2.2. Requirements Language . . . . . . . . . . . . . . . . . . 4 3. Method Introduction . . . . . . . . . . . . . . . . . . . . . 4 4. Implementation Process . . . . . . . . . . . . . . . . . . . 6 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 8 6. Security Considerations . . . . . . . . . . . . . . . . . . . 8 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 8. Normative References . . . . . . . . . . . . . . . . . . . . 8 Authors’ Addresses . . . . . . . . . . . . . . . . . . . . . . . 8

1. Introduction

Digital twin network is a virtual representation of the physical network. Such virtual representation of the network is meant to be used to analyze, diagnose, emulate, and then control the physical network based on data, models, and interfaces. The DTN architecture diagram is shown in Figure 1.

Figure 1: Figure1:Reference Architecture of Digital Twin Network

The digital twin layer forms a network element model by modeling physical network elements, and the network element model forms a twin network element through instantiation, that is, each physical network element in the physical network has a corresponding twin network element in the digital twin layer. Similarly, each physical flow of the physical network also has a corresponding twin flow at the digital twin layer.

Traditional network delay measurement methods include active measurement, passive measurement, hybrid measurement, etc., but they all have some disadvantages:

1) It is necessary to inject measurement packets into the physical network, but this will affect the forwarding behavior of actual service traffic, affect the accuracy of delay measurement, and increase the network burden and occupy network resources;

2) It is impossible to perform accurate delay measurement on the packets of all network protocols. For example, it is difficult to measure the one-way delay for UDP packets;

3) Some solutions need to change the format of service packets and insert measurement parameters, but this requires upgrading the physical network, which is difficult to implement, and affects the normal forwarding behavior of service packets and affects the measurement accuracy;

4) The time synchronization protocol is required to measure the one- way delay of the network, and the physical network is required to support this protocol, which increases the difficulty of implementing the solution.

2. Conventions Used in This Document

2.1. Terminology

NTP Network Time Protocol

PTP Precision Time Protocol

DTN Digital Twin Network

2.2. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14[RFC2119][RFC8174] when, and only when, they appear in all capitals, as shown here.

3. Method Introduction

The delay measurement method based on DTN is as follows:

1) According to the digital twin network architecture, build a digital twin layer, including twin network elements corresponding to physical network elements, such as twin switches, twin routers, etc.;

2) Time synchronization is maintained between each twin network element in the digital twin layer. a) If multiple twin NEs are in the same physical entity, such as the NFV-based modeling method, where multiple twin NEs are deployed in one server and share the same local clock, the twin NEs themselves is time-synchronized; b) If multiple twin NEs are deployed in different physical entities, use PTP (Precise Time Protocol) [IEEE.1588.2008]or NTP (Network Time Protocol) [RFC5905]to achieve time synchronization between physical entities to ensure time synchronization of all twin NEs;

3) The data transmission from the physical network layer to the digital twin layer uses a delay deterministic network (Detnet) to ensure that the data transmission delay between each physical network element and the twin network element is deterministic or pre- calculable, as shown in the figure 2. T1˜Tn is the delay of data transmission; the delay deterministic network can be based on TSN or DIP technology;

4) When a flow of the physical network is input from the physical network element 1, passes through the physical network elements 2 and 3, and finally is output from the physical network element n. When physical network element 1 receives the data packet, it will normally forward the data to physical network element 2 and transmit the data to twin network element 1 at the same time. At this time, the local time of the twin NE 1 is t1, and the deterministic network transmission delay is T1, then the arrival time of the traffic information recorded by the twin NE is t1-T1; similarly, the arrival time of the data packet recorded by other twin NEs is tn- Tn.

5) Finally, according to the arrival time of the data packet at the twin network elements, its one-way transmission delay between physical network elements can be calculated.

+--------------------------------------------------------------+ | Digital Twin Network +----------+ | | +---------+ Twin NE 3+----------+ | | | +----------+ | | | | | | | -----------+ +-----+----+ +----------+ +-----+----+ | | | Twin NE 1+----+ Twin NE 2+----+ Twin NE 4+----+ Twin NE n| | | -----------+ +----------+ +----------+ +----------+ | +--------------------------------------------------------------+ | +-------------------------------+------------------------------+ | Delay Deterministic Networking | +-------------------------------^------------------------------+ | +---------------------------------+---------------------------------+ | Phsical Network +------------+ | | +----------+Physical NE3+----------+ | | | +------------+ | | | | | | | +------------+ +-----+------+ +------------+ +------+-----+ | | |Physical NE1+---+Physical NE2+---+Physical NE4+---+Physical NEn| | | +------------+ +------------+ +------------+ +------------+ | +-------------------------------------------------------------------+

Figure 2: Figure 2: Between the physical network and the twin network is a delay deterministic network

4. Implementation Process

The detailed calculation process is shown in Figure 3:

(1) When the traffic data to be measured reaches physical network element 1, physical network element 1 forwards the traffic to physical network element 2, but also transmits the data to twin network element 1, and the transmission delay is T1. The local time of network element 1 is t1, and the arrival time of the data recorded by twin network element 1 is t1-T1;

(2) When the data packet is forwarded to physical network element 2, physical network element 2 will also forward it to physical network element 3 normally, but also to twin network element 2, and the delay to reach twin network element 2 is T2 , at this time, the local time of twin network element 2 is t2, and the arrival time of data packet information recorded by twin network element 2 is t2-T2, then (t2-T2)-(t1-T1) is the data packet from physical network element 1 to One-way delay of physical network element 2.

(3) Similarly, when the data packet reaches the nth physical network element, the nth physical network element will also transmit the data packet to the twin network element n. The data transmission time is Tn, and the local time of the twin network element n is tn, then record tn. -Tn is the time when the packet reaches the twin network element n, then (tn-Tn)-(t1-T1) is the one-way transmission delay of the data packet from physical network element 1 to physical network element n;

So far, the one-way transmission delay of data packets between physical NEs is obtained by calculating the time when the data packet to be tested reaches the twin NEs. During the measurement process, only time synchronization between twin NEs is required, but no physical network is required. Inter-meta time synchronization. The accuracy of delay measurement depends on the time synchronization accuracy of the twin network elements and the time synchronization accuracy of the delay deterministic network. If both use the PTP synchronization protocol, the delay measurement accuracy can reach the nanosecond level.

Figure 3: Figure 3: Delay Measurement Process

5. Conclusion

This method can realize segment-by-segment or end-to-end one-way delay measurement in the physical network. The advantages of this method include: no need to send measurement packets, all traffic protocol types can be measured, physical network configuration is not changed, and traffic data format is not changed. , It does not need the physical network to support the time synchronization protocol, and the measurement accuracy is high.

8. Normative References

[IEEE.1588.2008] IEEE, "IEEE Standard for a Precision Clock Synchronization Protocol for Networked Measurement and Control Systems", July 2008.

[RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, "Network Time Protocol Version 4: Protocol and Algorithms Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, <https://www.rfc-editor.org/info/rfc5905>.

[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, <https://www.rfc-editor.org/info/rfc8174>.

Danyang Chen China Mobile Beijing 100053 China Email: chendanyang@chinamobile.com

Internet Research Task Force H. YangInternet-Draft C. ZhouIntended status: Informational China MobileExpires: 2 January 2023 1 July 2022

Digital Twin Network Flow Simulation draft-yz-nmrg-dtn-flow-simulation-00

Abstract

Some important application scenarios of digital twin network, such as network new technology experiment, network configuration verification, network performance optimization, etc., all require the virtual traffic in the twin network to accurately simulate the real traffic in the physical network.The real traffic in the physical network is called the physical traffic, and the virtual traffic in the twin network is called the twin traffic. In order to realize the high-fidelity simulation of the physical traffic by the twin traffic, this paper proposes that the twin traffic and the physical traffic should satisfy three consistent characteristics, and An implementation method of twin flow is introduced.

Status of This Memo

Copyright Notice

This document is subject to BCP 78 and the IETF Trust’s Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document.

Yang & Zhou Expires 2 January 2023 [Page 1]

Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.

Table of Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Conventions Used in This Document . . . . . . . . . . . . . . 4 2.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 2.2. Requirements Language . . . . . . . . . . . . . . . . . . 4 3. Key characteristics of DTN flow . . . . . . . . . . . . . . . 4 4. DTN flow implementation method . . . . . . . . . . . . . . . 4 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 8 6. Security Considerations . . . . . . . . . . . . . . . . . . . 9 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 8. Normative References . . . . . . . . . . . . . . . . . . . . 9 Authors’ Addresses . . . . . . . . . . . . . . . . . . . . . . . 9

1. Introduction

Digital twin network is a virtual representation of the physical network. Such virtual representation of the network is meant to be used to analyze, diagnose, emulate, and then control the physical network based on data, models, and interfaces. The DTN architecture diagram is shown in Figure 1.

Figure 1: Figure1:Reference Architecture of Digital Twin Network

The digital twin layer forms a network element model by modeling physical network elements, and the network element model forms a twin network element through instantiation, that is, each physical network element in the physical network has a corresponding twin network element in the digital twin layer. Similarly, each physical flow of the physical network also has a corresponding twin flow at the digital twin layer.

Through the real-time data interaction between the physical network and the twin network, the physical network elements, network topology, network traffic, network status and other data in the physical network are virtualized at the twin network layer. The topology of the physical network and the twin network are consistent, The number of NEs is the same, and the traffic information is the same.

2. Conventions Used in This Document

2.1. Terminology

DTN Digital Twin Network

2.2. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14[RFC2119][RFC8174] when, and only when, they appear in all capitals, as shown here.

3. Key characteristics of DTN flow

The twin network layer needs to accurately simulate the traffic of the physical network to support the normal operation of the network application layer.The twin traffic of the twin network layer and the physical traffic of the physical network need to satisfy the following three characteristics at the same time.

1) The two traffic forwarding paths are consistent, that is, the twin nodes that twin traffic passes through at the twin network layer are consistent with the physical nodes that physical traffic passes through at the physical network layer;

2) The network performance of the two types of traffic is consistent, that is, the twin traffic and the physical traffic have the same performance as network delay, packet loss, and jitter;

3) The two traffic data characteristics are consistent, that is, the data packets of twin traffic and physical traffic have the same key characteristics such as traffic rate, quintuple information, data packet length, and data packet priority;

4. DTN flow implementation method

If the twin flow and physical flow are to meet the above three characteristics, three problems need to be solved:

1) The physical network element and the twin network element have unique identifiers in the entire network, so as to realize the mutual correspondence between the two. The physical traffic passes through those physical network elements, and the twin traffic also passes through the corresponding twin network element, so as to achieve the same forwarding path;

2) The physical flow is uniformly collected and managed by the Data Repository of the twin network layer, and then distributed to each twin network element. Because the time for each physical network element to complete data collection and data transmission is inconsistent, in order to ensure that the twin flow and physical flow have the same performance as forwarding delay, packet loss, and jitter, the twin flow must be delayed by a fixed time. That is, the twin flow delays the physical flow by a fixed time.

3) The flow data collected by the Data Repository should include the key information of physical flow, so that the twin flow and physical flow data characteristics are consistent; when the Data Repository collects physical flow, it can be collected in full package by package or partially collected at a certain sampling rate;

+--------------------------------------------------------------+ | Digital Twin Network +----------+ | | +---------+ Twin NE 3+----------+ | | | +----------+ | | | | | | | -----------+ +-----+----+ +----------+ +-----+----+ | | | Twin NE 1+----+ Twin NE 2+----+ Twin NE 4+----+ Twin NE n| | | -----------+ +----------+ +----------+ +----------+ | | +-----------------+ | | | Data Repository | | | +-----------------+ | +--------------------------------------------------------------+ | +-------------------------------+------------------------------+ | Delay Deterministic Networking | +-------------------------------^------------------------------+ | +---------------------------------+---------------------------------+ | Phsical Network +------------+ | | +----------+Physical NE3+----------+ | | | +------------+ | | | | | | | +------------+ +-----+------+ +------------+ +------+-----+ | | |Physical NE1+---+Physical NE2+---+Physical NE4+---+Physical NEn| | | +------------+ +------------+ +------------+ +------------+ | +-------------------------------------------------------------------+

Figure 2: Figure 2: Twin Flow and Physical Flow

For the above three problems, use the following three methods to solve:

1) Each physical network element has a system MAC address, because the MAC address is unique in the whole network and can be used as the identifier of the physical network element. The twin NE ID can be extended based on the physical NE ID. For example, an 8-bit custom field is added after the MAC address of the physical NE system, for example, to identify the device type. The twin NE is identified based on the MAC address of the physical NE, which not only realizes the one-to-one correspondence between the physical NE and the twin NE, but also realizes the unique identification of the twin NE in the entire network.

2) The data transmission network between the physical network element and the Data Repository uses a delay deterministic network, such as TSN (Time Sensitive Network), DIP (Deterministic Internet Network), etc. Since the delays of different physical network elements to transmit data to the Data Repository may be different, if a delay deterministic network is used, the data transmission delays T1˜Tn are fixed and can be pre-calculated. After the Data Repository calculates T1˜Tn, the maximum value Tmax is selected as the reference time. Assume that the data collected from each physical network element arrives at the Data Repository from t1 to tn. If the data transmission time Tn<Tmax, the Data Repository waits for (Tmax-Tn) time before transmitting the data to the twin network elements. If Tn =Tmax, then Tmax-Tn=0, the Data Repository immediately transmits the data to the twin network elements. Because the Data Repository and twin network elements are deployed in the same local area network or the same physical entity (such as a server), the transmission delay between the Data Repository and each twin network element can be ignored. So far, all twin flow is delayed by a fixed time Tmax compared to physical flow, but the forwarding delay, jitter, packet loss and other performances of the two are the same.

3) The data collected by the Data Repository needs to contain key information of physical flow, such as physical network element MAC address, traffic sampling rate, source MAC, destination MAC, protocol type, source IP address, destination IP address, protocol number, source port number , destination port number, packet priority, packet length, packet forwarding delay, etc. The first two parameters are mandatory, and the latter fields are optional according to application requirements.

The implementation steps of twin flow are as follows, as shown in Figure 3:

(1) To build a digital twin network, the physical network elements and the twin network elements are in one-to-one correspondence through the unique identifiers of the entire network, and the number of network elements and the topology are consistent;

(2) The physical network element forms a data set of key flow information, such as {network element identification, sampling rate, source MAC, destination MAC, protocol type, source IP address, destination IP address};

(3) The Data Repository collects the data sets of each physical network element, and calculates the maximum delay Tmax of data transmission;

(4) After the Data Repository collects the data set, it is sent to the corresponding twin network element according to the physical network element identifier

(5) Twin network elements generate twin flow according to the sampling rate and flow information of the dataset. Because the data transmission delay between the physical network element and the Data Repository is fixed at Tmax, all the flow of the twin network is delayed by Tmax relative to the physical flow. . Because the Data Repository and the twin network elements are in the same server or local area network, the transmission delay is negligible.

Figure 3: Figure 3: The generation process of twin traffic

5. Conclusion

This paper realizes high-precision simulation of DTN twin flow, so that twin flow and physical flow meet the following three characteristics:

1) The forwarding paths of the two types of flow are the same, that is, the physical nodes they pass through are the same;

2) The network performance of the two types of flow is the same, that is, the two have the same performance as network delay, packet loss, and jitter;

3) The data characteristics of the two types of flow are consistent, that is, they have the same key characteristics such as flow rate, quintuple information, data packet length, and data packet priority;

8. Normative References

[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, <https://www.rfc-editor.org/info/rfc8174>.

Internet Research Task Force C. ZhouInternet-Draft D. ChenIntended status: Informational China MobileExpires: 11 January 2023 P. Martinez-Julia, Ed. NICT 10 July 2022

Data Collection Requirements and Technologies for Digital Twin Network draft-zcz-nmrg-digitaltwin-data-collection-00

Abstract

The Digital Twin Network is a network system with Physical Network and Twin Network, which can be mapped interactively in real time. The construction of Digital Twin Network requires real-time data of Physical Network to update the state of Twin Network. This document aims to describe the data collection requirements and provide data collection methods or tools to build the data repository for digital twin network.

Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

Copyright Notice

Table of Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Definitions and Acroyms . . . . . . . . . . . . . . . . . . . 3 3. Data Collection Requirements for Digital Twin Network . . . . 3 3.1. Target Driven and On-demand Collection . . . . . . . . . 3 3.2. Diverse Tools for Various Data . . . . . . . . . . . . . 4 3.3. Lightweight and Efficient Collection . . . . . . . . . . 5 3.4. Open and Standardized Interfaces . . . . . . . . . . . . 5 3.5. Naming for Caching . . . . . . . . . . . . . . . . . . . 6 3.6. Efficient Multi-Destination Delivery . . . . . . . . . . 6 4. An Efficient Data Collection Method for Digital Twin Network . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 6 4.2. Efficient Data Collection Mechanism . . . . . . . . . . . 6 4.3. Data Collection Process . . . . . . . . . . . . . . . . . 8 5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 6. Security Considerations . . . . . . . . . . . . . . . . . . . 10 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 8.1. Normative References . . . . . . . . . . . . . . . . . . 10 8.2. Informative References . . . . . . . . . . . . . . . . . 10 Authors’ Addresses . . . . . . . . . . . . . . . . . . . . . . . 10

1. Introduction

With the deployment of Internet of Things (IoT), cloud computing and data center, etc., the scale of the current network is expanded gradually. However, the increase of network scale leads to also increasing the complexity of the current network, and it induces plenty of problems. In order to improve the autonomy ability of network and reduce potential negative effects on physical and virtual networks, we consider that an endogenous intelligent and autonomous network architecture which achieves self-optimization and decision is indispensable (in general, self-management and self-operation). The digital twin technology answers to the challenge of building self- management systems because it can optimize and validate policies through real-time and interactive mapping with physical entities.[I-D.irtf-nmrg-network-digital-twin-arch]

Data is the cornerstone required for constructing a digital twin for a network, namely a Digital Twin Network (DTN). In the face of large network scale, data collection, storage and management are faced with great challenges. So, data collection methods and tools should meet the requirements of target-driven, diversity, lightweight and efficiency, while being open and standardized. Among all the requirements, achieving a lightweight and efficient data collection method is of the most importance. If the full-data collection method is adopted, huge storage space and bandwidth resource is needed, especially for complex scenarios that require real-time data and traffic from multi-source and heterogeneous devices. Therefore, it is extremely important to agree on lightweight and efficient data collection, aggregation, and correlation methods, toward building the telemetry data transmission, processing, and storage required to build a DTN system.

2. Definitions and Acroyms

PN: Physical Network

IMC: Instruction Management Center

DSC: Data Storage Center

DTN: Digital Twin Network

TSE: Telemetry Streaming Element

RDF: Resource Description Framework

CPE: Complex Event Processing

3. Data Collection Requirements for Digital Twin Network

3.1. Target Driven and On-demand Collection

The monitoring data of a network is the basis to build a DTN system. Such data is collected from physical and virtual networks. It includes, but is not limited to, the following types:

* Provisional and operational status of physical or virtual devices, as well as the network topology with all network elements.

* Running status of physical, logical, or virtual ports and links.

* Logs and events records of all the network elements.

* Statistics (packet loss, traffic throughput, latency, etc.) of flows and ports.

* Various data regarding users and services.

* Lift-cycle operation data of all network elements.

* All above data in time series.

The collection of network data for maintaining a DTN should be in target-driven and on-demand mode. It is not always necessary to collect complete network data list above because of the high cost of resources (CPU, memory, bandwidth etc.). The type, frequency and method of data collection aim to meet the application of a DTN depends on the specific network topology and application requirements.

3.2. Diverse Tools for Various Data

The different types of network data used to maintain a DTN have several characteristics. Some data (e.g. port statistics, key link info, etc.) requires higher collecting frequency, and some data (e.g. flow status, link fault, etc.) needs to be of higher level of real- time. Some data (e.g. device status, port statistics, etc.) can be collected directly and simply via normal tools, while some data (e.g. per-flow latency, traffic matrix, etc.) can only be acquired through complex network measurement. Therefore, multiple tools or methods are needed to collect the massive data required to build the DTN entity.

Currently, some widely-used tools, such as SNMP, NetConf, Telemetry, INT (In-band Network Telemetry), DPI (Deep Packet Inspection), etc. can be candidate tools to collect data for digital twin network. Going forward, it is necessary to study new data collection technology in the following aspects in combination with the data requirements of network application for DTN:

* High-performance data collection technology based on programmable circuits.

* Measurement methods for complex network data such as network performance and network traffic.

* Collaborative data collection technology for multiple data sources.

* Distributed and collaborative data collection technology for complex network, and the time synchronization problem of data acquisition.

3.3. Lightweight and Efficient Collection

Data collection tools and methods should be as lightweight as possible, so as to reduce the occupation of network equipment resources and ensure that data collection does not affect the normal operation of the network. The major requirements are list as below.

* Data collection tools and methods needs to improve efficiency of execution, reduce the cost of computing, storage and communication bandwidth.

* The collection of redundant data should be avoided or minimized.

* For the data set that needs to be collected, make full use of the data compression technology, to reduce the resource cost in the collection phase.

3.4. Open and Standardized Interfaces

Data collection interface used to build the DTN should be open and standardized to help avoid either hardware or software vendor lock, and achieve inter-operability. The major requirements of data collection interfaces are:

* Support configuration management, including the data collection protocol, frequency or period, etc.

* Support several speed options (e.g. minute-level, 10-second level, second level (near real time), and real time level) to accommodate different data requirements from applications.

* Be extensible so that more features can be added with limited parameter changes and with backward compatibility.

* Be able to provide secure and reliable information exchange mechanism.

3.5. Naming for Caching

Both raw network data and knowledge items obtained from monitoring must be able to be addressed uniquely. This means to give a unique identifier or "name" to each data or knowledge item that references it. This name will be used by caching mechanisms to store the data and provide it for clients that request it, which will also use such name.

3.6. Efficient Multi-Destination Delivery

The maintenance of DTN systems will not be the sole purpose of monitoring information and knowledge communication. Other applications would also request raw telemetry data or knowledge items. They can use the name to identify it. The telemetry system, following the recommendations of RFC 9232 [RFC9232], will deliver the requested data or knowledge items to the requesters as much efficiently as possible. On the one hand, items will be provided by the closest cache to the destination of the data. On the other hand, items will be replicated in the best nodes, following an efficient multi-cast spanning tree. Different underlying protocols can be used to achieve this mechanism.

4. An Efficient Data Collection Method for Digital Twin Network

4.1. Overview

The system that manages the DTN maps, in real time, the PN to the DTN. However the existing methods collect the full data from the PN for modeling, and do not consider problems like time-lag, insufficient storage resources, low computational efficiency and waste of bandwidth resources caused by data transmission. In order to solve these problems, this section introduces an efficient data collection method for maintaining the DTN. This data collection method is based on sending instructions to the elements of the PN for them to pre-process the data (data cleaning or knowledge representation) before sending it back to be applied to the DTN.

4.2. Efficient Data Collection Mechanism

The management system structure consists of the PN and the DTN. The PN includes multiple Data Storage Centers (DSC) and Telemetry Streaming Element (TSE), and the DTN includes the Instruction Management Center (IMC) and Data Storage Center (DSC). The TSE has multiple functions, including data collection, data aggregation, data correlation, knowledge representation and query, etc. In addition, a Complex Event Processing (CEP) engine is integrated into TSE to perform queries to the streamed data. The IMC has two functions. On

the one hand, it is used to manage the registration of the DSC in the PN side, and its registration information can include various key information such as the IP address of the DSC in the PN side, chosen data type, and various index names in the data, data source name and data size, etc. On the other hand, it is used to adaptively configure data collection instructions according to the collection requirements of the DSC in the DTN side and search for IP addresses to send instructions. The instruction-carrying information includes rule-based mathematical expressions, executable models in .exe format, dynamic collection frequency, parameter lists, program text files in .m format, text files with parameter configuration, and other types of files. Instructions are flexible and programmable, and can be created, modified, combined, and deleted at any time according to requirements. When the DSC of the DTN side requests data to the IMC, the IMC searches the IP address of the DSC in the database with the registration information, which is built according to critical information, such as data type and data name, and functional instructions for data processing or knowledge representation can be implemented depending on the demand configuration. The DSC of the DTN side stores the effective information after data processing and knowledge representation returned by the TSE.

The DSC in the PN side has two functions. On the one hand, it stores data of various types, such as performance indicators, operational status, log, traffic scheduling, business requirements, etc. On the other hand, it has the function of automatically parsing the instructions sent by the TSE. Then the operating environment of the instruction is configured according to the instruction needs, and data processing or knowledge representation is performed based on the instruction. Data processing mainly includes data cleaning, filling missing data, normalization, conflict verification, etc. Knowledge representation refers to the representation of the original data as a data structure that can be used for efficient computation. Such representation results are closer to machine language, which is conducive to the rapid and accurate construction of the model. The role of knowledge representation is to represent the original data as a data structure that can be used to efficiently calculate. Such representation results closer to the machine language, which is conducive to the rapid and accurate construction of the model.

+------------------------------+ +-----------------------+ | Physical Network | | Digital Twin Network | | +-----+ +-----+ +------+ | | +------+ +-------+ | | | | | | | | | | | | | | | | | DSC |... | DSC | | TSE | | | | IMC | | DSC | | | | | | | | | | | | | | | | | +-+---+ +--+--+ +---+--+ | | +---+--+ +----+--+ | | | | | | | | | | +------------------------------+ +-----------------------+ | | | | | | 1.1. Register | | | +-----------+---------> | | | | | | | | | 1.2. Register | | | +---------> | | | | | 1.3. Register | | | | +---------------> | | | | 2. Data req. | | | | <----------+ | | | 3. Query and instruction | | | | configuration | | | | + | | | 4. Send instructions | | | <---------------+ | | | | | | | | 5. Parse and execute | | | | instruction | | | 6. Data subscript. | | | <---------------------+ | | | 7. Knowledge | | | | representation | | | | 8. Data pushing | | | +---------------------> | | | | 9. Data aggregation and | | | | correlation | | | | | 10. Send processed data | | | +--------------------------> | | | | |

Figure 1: Data Collection Process

4.3. Data Collection Process

The specific process is as follows:

* The DSC in the PN side registers into the TSE. The TSE registers into the IMC. Both provide their IP addresses, the data type, the data source, the data size, etc.

* The DSC in the DTN side sends the data collection request to the IMC.

* According to the data collection request, the IMC intelligently queries the registration addressing information and configures the data processing instruction.

* The IMC in the DTN side sends the corresponding instruction according to the query result to the TSE.

* After receiving the instructions, the TSE parses them and executes them. The query function can be performed by the CEP engine, which receives all telemetry data and processes it with all queries provided.

* The TSE sends data subscription to DSC in the PN side.

* The DSC in the PN side represents the data semantically in RDF form or sends the data in raw form to the TSE for it to make the semantic representation.

* The DSC in the PN side pushes the data or knowledge item to the TSE.

* The TSE aggregates and correlates the collected data or knowledge items. Then, according to the actual needs, generates aggregated data or knowledge items.

* The TSE sends the resulting data or knowledge items to the DSC in the DTN side.

5. Summary

This draft describes the requirements for data collection and provides the data collection methods or tools required to build the data repository for maintaining DTN systems. These data collection methods or tools should meet the requirement of target-driven, diversity, lightweight and efficiency, while being open and standardized. Among all the requirements, lightweight and efficiency requirements are the most important. Thus, this draft provides a lightweight and efficient method for data collection that is particularly optimized for maintaining DTN systems. Going forward, more methods (transformation and aggregation functions) and tools (solutions) shall be studied to extend the contents of this draft.

8. References

[RFC9232] Song, H., Qin, F., Martinez-Julia, P., Ciavaglia, L., and A. Wang, "Network Telemetry Framework", RFC 9232, DOI 10.17487/RFC9232, May 2022, <https://www.rfc-editor.org/info/rfc9232>.

[I-D.irtf-nmrg-network-digital-twin-arch] Zhou, C., Yang, H., Duan, X., Lopez, D., Pastor, A., Wu, Q., Boucadair, M., and C. Jacquenet, "Digital Twin Network: Concepts and Reference Architecture", Work in Progress, Internet-Draft, draft-irtf-nmrg-network-digital- twin-arch-00, 21 March 2022, <https://www.ietf.org/archive/id/draft-irtf-nmrg-network- digital-twin-arch-00.txt>.

Danyang Chen China Mobile Beijing 100053 China

Email: chendanyang@chinamobile.com

Pedro Martinez-Julia (editor) NICT 4-2-1, Nukui-Kitamachi, Koganei, Tokyo 184-8795 Japan Email: pedro@nict.go.jp

Network Working Group A. Clemm Internet-Draft L. Dong ...

Documents

Transcript of Network Working Group A. Clemm Internet-Draft L. Dong ...

Network Working Group M. Jethanandani, Ed. Internet-Draft ...

DRAFT REPORT

QH Tong the KTXH Dong Nai den 2020

Working Paper Review of Draft Assisted Reproductive ...

Working Draft T13 American National Project 1532D Volume 1 ...

Chinese and Globalization (Kroon, Blommaert & Dong Jie 2014)

CHU'ONG TRiNH HQP D~I HQI DONG CO DONG THU'ONG ...

Bao ho lao dong - Siêu thị thiết bị

2013 Hypoxia Forum working agenda draft 14 March with Day 3.docx

P&O draft final draft

Bang gia thiet bi dong cat fuji

Hop Dong Ngoai Thuong

Entheogens—Sacramentals or Sacrilege? A Working Draft

BANG GIA DAY CAP DONG CADI-SUN new.pdf

Green Libertarianism (Working draft)

Índigo azul. Tejidos Miao y Dong.

Gastón de los Reyes, Jr. WORKING PAPER draft of 8.2012 So ...

(PROPOSED WORKING DRAFT FOR THE JJWC)

Dong luc hoc may Dỗ Sanh

Draft Final Report of the Working Group on Youth Affairs and ...