Modèles probabilistes et statistiques pour la conception et l’analyse des systèmes de communications

Facebook Tweet Pin Email

Internet is the first real multi-service network; voice, video and data services are all offered on the same network. Among the many consequences that this convergence brought with it, the change on the traffic structure is probably between the most prominent ones. Not only has the amount of traffic increased, but its nature has radically changed too, mainly due to the offer of new advanced services (with a socio-economic impact that is incontestably important). The popularity of these services, such as for instance file-sharing (e.g. KaZaa, eDonkey, BitTorrent), Internet telephony (e.g. Skype, Gtalk, VoIPbuster) and Internet television (e.g. Sopcast, Joost, Zatoo, Babelgum) has exploded in the last years. As a result, network data traffic is increasingly complex and dynamic.

These two characteristics translate into a high traffic unpredictability (at least with traditional and simple techniques). Predicting traffic volume is very useful for both capacity planning and self-management schemes (even if they involve very different time scales). For instance, a typical application of this prediction is the dynamic resource reservation in the context of Virtual Private Networks (VPN) provisioning [1]. Another application, related to the emerging discipline of green networking [2], is the Adaptive Link Rate (ALR) problem. ALR is proposed as a way to reduce energy consumption by means of adjusting the configured link rate on the router to the minimum required. It would be interesting then to design a prediction scheme that accurately predicts such variable traffic in an online manner, and at the same time minimizes the assumptions on the traffic structure (e.g. no assumption on the statistical properties of the measurements should be made).

However, the prediction of some traffic features as the mean or the maximum over a given time interval does not give information about the traffic itself. Certain trafficaware applications require to actually identify the type of traffic traversing the network, or even the application generating it. For instance, traffic (application) classification is instrumental to the application of QoS policies, is necessary to block specific banned applications (e.g. P2P or chat in an office), or to identify popular applications with commercial purposes (e.g. advertising). An outstanding example of these new types of traffic is P2P-TV (Peer-to-Peer Television); large-scale real-time video-streaming services exploiting the peer-to-peer (P2P) communication paradigm. There are several currently deployed P2P-TV systems which feature low-quality streaming [3, 4, 5, 6], but high-quality systems will be soon of widespread use too [7,8].

P2P-TV systems are candidates for becoming the origin of serious problems for the Internet, as testified by the growing success of commercial systems such as PPLive, SopCast, TVAnts and many others. Indeed, P2P-TV traffic may potentially grow without control, causing a degradation of the quality of service perceived by Internet users or even the network collapse [9]. While downlink is limited by the stream rate, uplink may grow unboundedly as observed in [10]. Furthermore, the proprietary (and closed) design of some successful P2P-TV applications turns the identification of such applications in a topic of growing importance. For instance, an ISP (Internet Service Provider) will certainly be interested in blocking these resource consuming applications, or at least in identifying them.

Although the traffic classification problem is not new, for different reasons some well known techniques are becoming obsolete. Examples of these are port-based or payloadbased classification. Whereas the former one can be misled by, for instance, firewall blocking or dynamic port allocation, the computational complexity of the latter may be as high as to make it not implementable. As an alternative, very good performances are obtained by means of statistical classification [11, 12]. In addition, behavioural classification, based on the rationale that different applications generate different patterns, was used in the context of coarse-grained classification of Internet hosts [13, 14]. Thus, we interest ourselves in the design of a classification engine targeted to P2P-TV applications (i.e. fine-grained) based on the behavioural paradigm.

Another characteristic of today networks related to the above considerations is the use of new emerging technologies. Indeed, as traffic injected to the network is increasingly heterogeneous, so are the access technologies. In particular, wireless have increasingly become the edge technology of choice. It is expected that in the near future, with the proliferation of smart devices with new functionalities, communication between wireless terminals will usually take place without the core being involved. These terminals will act in a self-organized and adaptive manner, falling in the category of Mobile Ad-hoc NETworks (MANETs).

At the moment of analyzing the design or the performance of MANETs, the shared nature of the medium appears as one of the most difficult aspects to cope with. As such, the design of the Medium Access Control (MAC) mechanism plays a key role. Ideally, a well designed MAC algorithm allows the maximum number of simultaneous transmissions such that they do not interfere with each other. All of this, minimizing unfairness in the access opportunities. However, such design is a really difficult task when a decentralized mechanism is considered (which is the ideal case). The MAC mechanism is then ran locally by each node, which has only local information of the network state. The most widely deployed MAC mechanism is CSMA/CA (Carrier Sense Multiple Access with Collision Avoidance), used for instance in IEEE 802.11 and IEEE 802.15.4. Even if it has been relatively well studied from an empirical point of view, there is no widely accepted mathematical model for it. In this sense, an analysis of the protocol, and the identification of its relevant features for modeling purposes is, surprisingly, missing. For instance, current models do not take into account the randomness of channel conditions due to fading/shadowing effects [21, 22]. Furthermore, some aspect of CSMA/CA are simply ignored, as the Clear Channel Assessment (CCA). In this context we analyze whether these models can be extended to the non deterministic case, or if it is necessary and possible to define a new model for CSMA/CA.

In any case, CSMA/CA does not guarantee anything in terms of the quality obtained by the accepted transmissions. This lack of QoS (Quality of Service) could hinder the implementation of, for instance, advanced real-time applications. It would then be interesting to design and analyze a control access mechanism with performance guarantees. As mentioned above, the design of such a mechanism must cope with the decentralized nature and the shared resources (where the QoS of each connections depends on all active connections) of MANETs.

This thesis is organized in two parts. The first one is dedicated to the design of tools based on Machine Learning, in particular SVM [15, 16], to cope with two of the problems previously mentioned: link load prediction and P2P-TV traffic classification. SVM are a set of methods for classification and regression that are grounded in the framework of statistical learning theory. Even if the first work on this topic dates from the seventies [17], it only gained the international academic community attention in the nineties. SVM have been extensively used mainly in the context of pattern recognition in which they have shown very good performance. In the context of networking applications, they have been used, for instance, for anomaly detection or throughput estimation [18, 19]. We interested ourselves in these techniques since they work well in many learning situations, due to its good generalization to unseen data. Moreover, they are amenable to continuous and adaptive online learning which constitutes an extremely desirable property for our purposes.

As a first step, we perform a deep and comprehensive study of the SVM techniques. We then analyze in detail its performance when applied to the above mentioned networking problems. As a general remark, we can say that in both cases SVM proved to be a very useful and powerful tool. Whereas for the link load prediction, SVM robustness and low computational complexity appear as its main assets, for the P2P-TV classification SVM provides impressively good results.

More precisely, for the link load prediction, we follow the methodology known as “embedding procedure”. A time series of averaged link load values at the chosen time scale is considered, and a future value of this time series is predicted based only on a small number of past observations. In particular, we choose for this work small time scales, i.e. less than one minute. An extensive sensitivity analysis is presented which shows the SVM robustness in terms of parameter tuning and training, as well as its costeffectiveness. Its accuracy is gathered by means of a thorough comparison phase that includes parametric and non-parametric models, ranging from simply moving averages to more sophisticated models such as the Nadaraya-Watson estimator [23]. We found that in any case, SVM provides the best results.

Furthermore, we show that the performance of the predictor can be improved if a different approach is taken. We still use SVM but we define different inputs/outputs: the prediction is in this case focused on a function of the link load (e.g. the maximum or percentile value over a given time interval) and the input is constituted by a summary of statistical properties of the observations taken in a past interval of equal length. In this case, we show that the use of several “machines” in parallel can further improve the performance gain and avoids the decision of a particular input choice.

Concerning P2P-TV traffic classification, we propose a novel methodology that relies only on the count of packets and bytes exchanged among peers during small timewindows. Our claim is that the different applications will exchange different amounts of information concerning their particular operation (e.g. signaling activities or video chunk size), and that this information conveniently exploited is enough to identify the particular application. Our classification framework, which makes use of SVM, not only accurately identifies P2P-TV traffic (95% of correctly classified traffic in the worst case), but also handles traffic that is not generated by P2P-TV applications, so that false classification events are negligible. A thorough experimental campaign is performed to validate our results, which uses both active and passive methodologies and is representative of a large set of possible scenarios. Moreover, we analyze the portability of the defined signature, showing that our framework can be extended to very different situations, which constitutes a very important benefit.

In the second part of the thesis, we deal with two different problems related to the multiple access mechanisms in ad-hoc networks. The first one is related to the modeling of the well known CSMA/CA mechanism. We found out that in many of the proposed models it is not clear which are the underlying assumptions and their correlation with the real protocol. Without aiming at providing a tutorial on the topic, we present a clarifying picture of the several models that, to the best of our knowledge, have been proposed in the literature for the analysis of the performance of CSMA/CA. In particular we discuss when the assumption on the CCA mode can be overridden and in which cases a model based on the definition of exclusion domains is appropriate.

In addition, we propose a new model for CSMA/CA that takes into account the shadowing/fading effects, and we compare it with previous proposals. This comparison allows us to prove that this aspect plays a major role in the performance of the mechanism. Furthermore, we show that models originated in the stochastic geometry context provide analytical formulae for some important performance indicators, such as the access probability or the spatial reuse. These models are known to be conservative, but up to now a quantitative analysis of the magnitude of this difference is still lacking. We analyze this gap assuming a slotted division of time for both deterministic and random attenuation functions. We show that, even if significant differences can appear in terms of spatial reuse (and so in the rate), when other performance metrics that include both the spatial reuse and rate are considered, these differences are reduced. Moreover, it should be said that results obtained by Mat´ern like models are qualitative similar to real ones.

An underlying hypothesis of the analyzed models is that they assume that the CCA is performed in Carrier Detection (CD) mode. However, other modes can be considered, for instance Energy Detection (ED) mode. Whereas for the first case the channel is reported as busy if at least one signal is detected, for the second one the condition is over the total received power (interference). For this last case, all previously introduced models are no longer valid. We propose then an analytical framework, based on stochastic geometry tools, that allows us to provide a deep comparison between these two modes. In particular, analytical formulae is given for the access probability under both considered modes. Moreover, a mixture of these two modes is also analyzed.

Table des matières

1 Introduction
1.1 Motivation
1.2 Main Contributions
1.3 Outline
2 Support Vector Machines
2.1 Vapnik’s Learning Theory
2.1.1 The VC Dimension
2.1.2 Structural Risk Minimization
2.2 SVM for Classification
2.2.1 The Linear Separable Case
2.2.2 The Linear Non Separable Case
2.2.3 The Nonlinear Case
2.2.4 Multi-Class Classification
2.3 SVM for Regression
2.3.1 Methods of Solution
2.4 The VC Dimension of SVM
3 Link Load Prediction
3.1 Related Work
3.2 Embedding Process
3.2.1 Reference Techniques
3.2.2 Sensitivity Analysis of SVR Performance
3.2.3 A Closer Look on Forecast Performance
3.3 Prediction of a Function of the Link Load
3.3.1 Experimental Results
3.3.2 Parallel SVR
3.3.3 Comparison with the Embedding Process
3.4 Conclusions
4 P2P-TV Traffic Classification
4.1 Related Work
4.2 Classification Framework
4.2.1 The Rationale
4.2.2 Behavioral P2P-TV Signatures
4.3 Methodology and Dataset
4.3.1 Model Generation
4.3.2 Rejection Criterion
4.3.3 Model Validation
4.4 Experimental Results
4.4.1 Classification results
4.4.2 Sensitivity Analysis
4.4.3 Improving the Accuracy: Extending the Signature
4.5 Signatures Portability
4.6 Conclusions
Conclusion