Relevance of the study for the Aeolus project

Component internal state

As we established before, one of the important features of software components which we wanted to include in the Aeolus model was their life-cycle. Most software components are not static from the system administration point of view, they have a dynamically changing internal state that a_ects their properties and behaviour. Motivation Intuitively in almost any service which we deploy in a distributed system we can distinguish at least two obvious internal states that we could call: not running and running. Of course this is a very simplified and high-level way of understanding the concept of an “internal state” of a piece of software. If we wanted to enter more deeply into modelling the details of functioning of any program, the representation of its internal state might get very complex. In the extreme case we can imagine that it could encompass everything down to the level of the full dump of the contents of its memory at any given moment of time, etc. Getting into so much detail is definitely not our aim here. In the Aeolus model we rather want to abstract all the low-level details and stick to the high-level view of how the services operate.

For us the important aspects of a software component’s internal state are those which are relevant from the distributed system administration point of view. Usually this means that we tend to care only about those internal state changes which really a_ect a given component’s properties in a way that directly concerns other components and thus may introduce connection in order to function properly and the database component is being shut down for maintenance reasons (which we can simply model as being switched to the not-running state) the aforementioned component will not work correctly any more and we will end up with a broken system. As we can see in this case, the order in which we switch on and o_ di_erent components can be important when we are reconfiguring our distributed system. Hence we would like to keep track of this kind of internal state information in the model. In practice the states that we include in the model resemble closely to the typical life-cycle of a real service: usually the set of possible states will contain at least elements like uninstalled, installed and running (or initial, configured, running). In some cases we will also add some more specific states in order to model more complex inter-component interactions requiring many stages or to allow more advanced types of service orchestration. Adding additional states is also natural if a certain component has many modes of functioning (e.g. running-as-master and running-as-slave). There is one more aspect of the problem that we should keep in mind: the evolution of the internal state of a given real service is generally not random, it follows some strict rules. For example many services cannot go directly from being uninstalled to being running, they have to pass through the intermediary installed state (where they are ready to work, but have not been launched yet). Including this kind of restrictions in the model is necessary if we really want to reproduce the real services’ behaviour.

State machine

After describing the background and explaining the reasons for key design decisions concerning the component internal state in Aeolus model, let us move on to the actual implementation of this feature. In the Aeolus model each component is fitted with a finite state machine. Every state of that machine represents a certain internal state of the corresponding real service operating in the distributed system. Although in most cases these are simply the “big” steps of the service’s life-cycle (e.g. uninstalled, installed, running, as depicted in figure 4.3), sometimes we may want to include also more subtle stages, like phases of a complex reconfiguration process (e.g. authenticated, synchronized) or multiple modes of functioning, as depicted in figures 4.4a and 4.4b. At every given moment each component is in a single precise internal state, known as its current state. The current state can evolve, following the transitions available in the state machine. Each transition on the model level corresponds to some actions happening in the real system and leading to the change of the internal state of the matching service. Mostly, what we call “actions” here refers to specific local administration tasks concerning the service in question. For example, passing from the uninstalled to installed state may be equivalent to installing an appropriate software package on a certain machine, and then passing from the installed to running state may simply correspond to launching the service (e.g. starting its daemon in the background). This kind of relation between the model-level state changes and the system-level actions is illustrated in figure 4.5. We do not prohibit at all some degree of communication with other services happening during these actions. For instance, when a web application is switching its state from installed to running it may contact the database and query some information, then the database will respond, etc. The action is meant to be local from the administrative point of view: on the model level it should have e_ect only in the scope of the component which changes his state2.

Meaning of port arities

Let us see now what our redundancy and capacity constraints on the model level are supposed to mean in the context of the corresponding real distributed system. Typical use The require arities are obviously useful to model all the one-requires-many relationships. We should note however, that these are not supposed to be functional requirements (i.e. one requires three to work correctly), but rather non-functional ones (i.e. one could work fine with one, but requires three because of high workload or fail safety reasons). So the require arities of our ports usually correspond not to real hard prerequisites, but more to our arbitrary deployment policy. Typical utilizations of this mechanism may encompass simply implementing fail-safe solutions based on redundancy (e.g. although the component X requires to be bound with two di_erent components, in reality it is using only one of them at any given time, but it will immediately switch to the other one if the current used one fails) and all variants of master-slave or frontend-backend patterns (e.g. the master component is dividing the work between multiple slaves, which permits it to handle bigger workloads e_ciently). Conversely the provide arities can be used to determine roughly what amount of workload a certain component should be able to support, given that we can estimate the amount of workload which each of the components using it will demand.

Deployment policy aspect The require and provide arities given to ports should not be automatically considered as inherent properties of their components, as they often belong more to the domain of the deployment policy of the whole system. Their exact values may depend strongly on the circumstances in which the components operate and usually only make sense in the context of a given distributed system. Di_erent systems based on similar components may be designed to satisfy di_erent non-functional requirements. For example, the components used to model a toy small-scale version of a certain system can be almost exactly the same as the ones used to model a serious large-scale one (providing that the considered distributed architecture scales up easily), but their require and provide arities will generally change in order to reflect the bigger expected workload and more strict fail safety requirements. In practice, a tool using the Aeolus model as an internal representation may well decide to maintain di_erent profiles for the arities associated to some components, and create an instance of the model using a given profile only when a particular analysis must be performed. Varying arities Another interesting feature of the require and provide arities as we defined them, is the fact that their values for a given port can vary from state to state. This gives quite a lot of flexibility and permits to model software components which can be configured in multiple di_erent ways and their non-functional (and functional possibly too) requirements depend on their mode of functioning.

Initial deployment

This is the first reconfiguration which every distributed system has to undergo, passing from an empty environment to a fully configured, working system providing certain functionalities. Usually we start with nothing (a number of clean machines or simply a public or private cloud) and we have some kind of a description of the final system that should be implemented. Then we install and set up components in the right order and we orchestrate them to work together in order to attain that final desired state. The main regularity particular to the initial deployment is that the general direction of the reconfiguration is intuitively simple and constant: up. We are building the system: we install components, configure them, make them running, establish connections between them, etc. Very rarely any destructive action or even change of existing configuration is required. We simply add all the elements of the puzzle in the right order until we attain the goal. The other specificity of this phase of the system’s life is that we usually do not expect the whole to be operative and functional until the very end of the deployment process. We can safely assume that during the entire time of the initial reconfiguration there are no clients, users or external applications actively utilizing the system and depending on it to work. We can imagine, that nobody from the outside will access it until we decide that the set-up is over and we give a green light (e.g. open the ports on the external firewall). Update After the system has been deployed, it has to be maintained.

This entails performing changes on an already working system. In some situations it means, that we are required to keep it actively running and providing services during the whole time of the reconfiguration. In others we are allowed to have some downtime, when the system is not expected to be fully operational. There are a few typical reasons of performing changes on a system that has been already set up. One of the most common reasons are software updates. Software, and especially highly popular open-source software, tends to evolve quickly: bugs are fixed, performances are improved, new features are introduced. In order to keep our system safe and up to date, we have to patch its elements regularly. Even if we are pretty conservative in this matter and prefer stability than supporting new features, we have to at least apply the most important security updates or we risk making it vulnerable to attacks. Other reason of reconfiguring a running system is to scale it up or down in order to adapt its capacities to the current or expected workload.

This is particularly habitual in case of systems deployed on clouds, as cloud-based solutions are often chosen specifically to be easily and frequently scalable. Both these cases have one thing in common: although preparing and performing them correctly may require careful planning and complete knowledge of all the component relations in the system (especially if we need to keep it running without an interruption), the introduced changes are typically quite restricted and they preserve the general structure of the whole deployment. Updates are usually limited to a single component at a time (unless they introduce some backward incompatibilities between versions) and scaling up and down entails adding or removing instances of components that already exist in the system. It is also possible sometimes that a major modification, which significantly restructures the system, needs to be performed. In these cases however keeping the system running all the time is rarely required (or possible), so it often happens that such substantial overhauls are performed in three phases: building a new system (separately from the existing one), migrating all the data to it and then tearing the old system down. In fact we should note, that in some environments this kind of procedure is also a common practice in case of much smaller updates, especially if we are not really sure if the updated system will work correctly and we have mastered a sure method of swiftly switching from the old system to the new one.

Table des matières

1 Résumé
1.1 Le travail fait
1.2 La structure du document .
2 Introduction
2.1 Work done
2.2 Document structure
3 Context
3.1 Motivation
3.2 Introduction
3.3 Relevance of the study for the Aeolus project
3.4 Object of the survey
3.5 Methodology
3.6 Summary of capabilities and limitations
3.7 Conclusion
4 Aeolus model for distributed systems
4.2 Design principles of the Aeolus model
4.2.1 Expected expressivity
4.2.2 Need for simplicity
4.2.3 Conclusion
4.3 Presentation of the Aeolus model
4.3.1 Introduction
4.3.2 Software components
4.3.3 Component internal state
4.3.4 Collaboration of components
4.3.5 Non-functional requirements
4.3.6 Conflicts between components
4.3.7 Configuration
4.3.8 Universe
4.3.9 Reconfigurations
4.3.10 Summary
4.4 Formalization of the Aeolus model
4.4.1 Universe and configuration
4.4.2 Configuration validity
4.4.3 Reconfigurations
4.5 Properties of the Aeolus model
4.5.1 Reconfigurability problem
4.5.2 Architecture synthesis problem
4.5.3 Restricting the model
4.5.4 Results
4.6 Expressivity of the Aeolus model
4.6.1 Distributed WordPress case
5 Synthesis of a component architecture
5.2 The architecture synthesis problem
5.2.1 Problem definition
5.2.2 Discussion
5.2.3 Stateless model
5.2.4 Request language
5.2.5 Stateless configuration validity
5.2.6 Optimization
5.2.7 Conclusion
5.3 Constraint solving approach
5.3.1 Search for solutions
5.3.2 The basic idea
5.3.3 Encoding
5.3.4 Final configuration generation
5.3.5 Conclusion
5.4 Conclusion
6 Synthesis of a cloud-based architecture
6.1 Motivation
6.2 Introduction
6.2.1 What we have
6.2.2 What do we want now?
6.2.3 Our approach
6.2.4 Conclusion
6.3 Extending the architecture synthesis problem
6.3.1 Machine park
6.3.2 Component co-location
6.3.3 Decisions
6.4 Extending the stateless model
6.4.1 Locations
6.4.2 Repositories and packages
6.4.3 Resources
6.4.4 Cost
6.4.5 Extended specification
6.5 Extended stateless Aeolus formal model
6.5.1 Universe and configuration
6.5.2 Specifications
6.5.3 Optimization
6.5.4 Conclusion
6.6 Adapting the constraint solving approach
6.6.1 Constraint problem
6.6.2 Encoding
6.6.3 Final configuration generation
6.7 Conclusion
7 The Zephyrus tool
7.1 Motivation
7.1.1 Scope
7.1.2 Implementing the theory in practice
7.1.3 Conclusion
7.2 General information and work schema
7.2.1 Zephyrus building blocks
7.2.2 The basic workflow
7.2.3 The input / output formats
7.2.4 The principal arc of the workflow
7.2.5 Elaborating the Zephyrus workflow
7.2.6 Overview of the Zephyrus workflow
7.3 Practical considerations
7.3.1 Introduction
7.3.2 What is fast enough?
7.3.3 How to improve efficiency?
7.3.4 Working on the constraint problems
7.3.5 Conclusion
7.4 Zephyrus workflow details
7.4.1 Introduction
7.4.2 Reading the input
7.4.3 The abstract I/O internal format
7.4.4 The internal Zephyrus model representation
7.4.5 Pre-process the model
7.4.6 Translate to the abstract constraint problem form
7.4.7 A concrete constraint problem form
7.4.8 Solve
7.4.9 Read the solution found by solver
7.4.10 Generate the final configuration
7.4.11 Validate
7.4.12 Generate the output
7.4.13 Workflow variants
7.4.14 Conclusion
7.5 Improving the efficiency
7.5.1 Reducing the input
7.5.2 Trimming the component types
7.5.3 Trimming the packages
7.5.4 Trimming the locations
7.5.5 Improving the generated constraint problem
7.5.6 Choosing and adapting the solver
7.5.7 Conclusion
7.6 Conclusion
8 Experimentation and validation
8.1 Motivation
8.2 Architecture synthesis efficiency
8.2.1 Flat distributed WordPress scenario
8.2.2 Basic flat WordPress benchmarks
8.2.3 Flat distributed WordPress scenario with Webservers
8.2.4 Advanced flat WordPress benchmarks
8.2.5 Fully distributed WordPress scenario
8.2.6 First series of distributed WordPress benchmarks
8.2.7 Machine park trimming benchmarks
8.2.8 Advanced distributed WordPress benchmarks
8.2.9 Optimization benchmarks
8.2.10 Conclusion
8.3 Integration in the Aeolus tool-chain
8.4 Industrial validation in Kyriba
8.4.1 Continuous integration in Kyriba Corporation
8.4.2 Zephyrus adoption
8.4.3 Conclusion
8.5 Conclusion
9 Related work
9.1 Aeolus formal model
9.1.1 Black-box / grey-box with an interface
9.1.2 Interface automata
9.1.3 Petri nets
9.1.4 Pi-calculus
9.1.5 Fractal model
9.1.6 SmartFrog model
9.1.7 UML validity
9.1.8 Reconfigurations
9.2 Tools and methods for mastering cloud complexity
9.2.1 Managing networks of interconnected machines
9.2.2 Computing the (optimal) final configuration
9.2.3 Other tools related in one or more aspects
10 Conclusion
10.1 Aeolus formal model
10.1.1 Contributions
10.1.2 Future work
10.2 Zephyrus model
10.2.1 Contributions
10.2.2 Future work
10.3 Constraint-based approach
10.3.1 Contributions
10.3.2 Future work
10.4 The Zephyrus tool
10.4.1 Contributions
10.4.2 Future work
10.5 Outcome