uLFS: Your Manageable and Reusable Linux From Scratch

Author: Roc Zhou(chowroc dot z at gmail dot com)
Date: 26 Jul 2010

GO TO: NEW BLOG



Abstract

The project Page is: [https://sourceforge.net/projects/crablfs/]

and the svn repository is: [https://crablfs.svn.sourceforge.net/svnroot/crablfs]

The original purpose of this project is a Package Manager for LFS distribution (The User Based Package Management System), which makes use of Linux's basic user/group mechanism to distinguish different packages(maybe I will add LD_PRELOAD mechanism later), and can rebuild customed LFS time and time again automatically by compiling the source code.

Furthor more I found this is a problem of reusability and managiability, especially when hundreds to thounsands machines to be managed. There are many duplications in the process of System Administration, so I changed the direction of the project.

Thus now there are 3 sub projects of this project: caxes, cutils and ulfs. you can find them in the subdirs of the SVN repository.

caxes is the base of the other sub projects. It includes several libraries and tools to improve the resuablity of system administration and network management, thus promote the level of automation.

One aspect of resuablity is configuration syntax reuse, thus caxes contains several modules to parse tree/table/sequence from plain text files, xml files or LDAP/SQL databases.

To complete this more graceful, several new data structures definitions in python is being made, include Tree and Table. So far, the most basic structrue of Tree has been created, you can find the document in the code of tree.py and its unittest test_tree.py, or look at here:

A Python New Data Structure

Another aspect is configuration sharing mechanism among dozens to hundreds of hosts, and makes the rebuilding of a system more convenient and more automatic. A tool named ctemplates will be created to achieve this, which makes use of the templates of configurations.

cutils is a set of tools depend on caxes, to finish many system administration tasks and daily Linux use, includes file system backup, protocols independent file transport, gmail backup, etc...

Currently I'm working on the fs_backup and mirrord/fs_mirror of cutils, fs_backup is a tool to backup the file system with identity to improve the manageability of backuping, and mirrord/fs_mirror is a agent/client application to synchronize the file systems of more than 2 hosts near realtime with the aids of inotify in the Linux kernel, thus you can make use of it to build a low cost hot backup(near line) system. For example, a group of 7 hosts, 5 hosts are online system with mirrord, 1 host is the hot backup system with fs_mirror, and 1 host is the substitutor(resue host) when one of the 5 online hosts crashes.

And with the aids of fs_info module also included by cutils, you can just backup/mirror the specified parts of the file system that contain key data, and exclude some subdirs of those parts, because fs_info afford this functionality.

For more details, please just jump to the Chapter:

"Backup and Recovery".

And its Chinese translation.

While ulfs sub project, is the Package Management System mentioned above. Since the compiling and installation commands will be recorded automatically by it, those profiles can be used as configuration by caxes, and shared by ctemplates, to make the building process more reusable not only to yourself, but also to anyone need it.

This document can be considered as a preliminary design, and can also be extended to a User Manual at last. This document source is also in the svn reposity 'doc' subdir, which is written in 'txt2tags'. The design details docs are in program source code(Python __doc__) and unit tests.

This document is mainly about reusability of computer systems, especially for a large amount of computers. But I will only talk about those aspects that affect reusability, and only introduce a basic concept of a basic Architecture for system administration, the other aspects of Architecture mostly have existed solutions and out of the scope of this document.

So far, there is only the English version of this document, I will write a Chinese version when I has time.

And there are a document about some of my system administration notes and experiences in Chinese at:

我的部分中文的系统管理笔记及经验文档

If I has time, I will translate it to English.

Thanks.

Enterprise Software Environment and its Reusability

Architecture Abstract

Before we talk about the reusability of the system envrionment, we must play eyes on the system architecture. So this system is not a single computer system, it is a system with dozens to hundreds of hosts in its environment.

Then what is architecture? Why?

Nowdays' services become more and more complicated, usually with many machines, sometimes cluster and distributed system is included, and more and more types of service are supplied, more and more tools and applications are invented everyday... Then, how to manager all of these things?

A compact system is not only a accumulation of hardwares and hosts, so we need a center to control the whole system, just like a head. it seems that it's impossible to complete all the manager tasks on only one machine... it's not a good idea even if you can do that, because at last it will become impossible.

So we must split the tasks to make them be finished by different hosts, this hosts at last shape a relatively small network for administration, which is the infrastructure of the whole system, just like the head with left and right brains, eyes, ears...

These different tasks systems can be called as subsystems. We can classify them like this:

  ** this graph should be shown with monospaced font **
  
                     <--- VPN implementation --->
  
                       +----------------------+
                       |   security system    |
                       +----------------------+
  
                       +----------------------+     +-----------------+
                       |  authentication      |     | document center |
                       |   && authorization   |     +-----------------+
    +-----------+      +----------------------+
    | service 1 |   
    +-----------+      +----------------------+ 
                       | monitor &&           |
    +-----------+      |     logs analysis    | 
    | service 2 |      +----------------------+
    +-----------+        
         ...           ########################
    +-----------+      # configuration &&     #\
    | cluster 1 |      #     control center   # \------------\
    +-----------+      ########################              |
                                  |                          |
                                  |                          |
                                  V                          V
                       ########################     ####################
                       # storage & backup sys #-----# revision storage #
                       ########################  |  ####################
                                  |              |
                                  |              |  %%%%%%%%%%%%%%%%%%%%
                                  |              \  % packages         %     
                                  |               \-%    collection    %
                                  V                 %%%%%%%%%%%%%%%%%%%%
                       +----------------------+
                       | backup of backups    |
                       |     on Mars          |
                       +----------------------+

In this graph, the subsystems on the center and right is the infrastructure and those enclosed with '#' or '%' is the important points for improving reusability, which I will discuss in the future. The left components are part of superstructure, which is out of my discussing range.

This graph is not a details graph, in fact, every infrastructure subsystem has logical connections with other subsystems and superstructure, for example the configuration center needs backup and revision control, and package management(the purpose of packages collection), the document center and monitor also need those! Details are in the next chapters.

Now we can talk about every subsystem with several segements:

  1. Security System

    The security as a subsystem plays 2 important roles: the filter firewall( such as iptables) and the IDS(Intrusion Detection System, such as snort), so I think it should be a part of the routers or gateways. There may be not only one gateway of an Intranet environment, so how to organize them is the thing we should pay attention on.

    There are also other security issues, such as every hosts' permission and ownership, tripware for the integrity, and every service's security parameters, etc. So how to organize those may can also be a part of security system. In general speaking, the authencication, authorization, log and monitor also have security functions, but they also have other purposes, so I talk about them separately.

    It is important to make the security system not be the bottleneck of the network bandwidth.

  2. Authentication and Authorization

    Authentication is to prove the one is the very one he/she declares, and Authorization is to decide what he/she can do in the system, with the aids of monitor, we can do resource limiting with authorization. These are accounts management.

    Nearly every service has authentication, denpends on one of the dozen of the authentication mechanisms. So to reduce the cost, a centralized authentication is necessary, otherwise you will have to manager serveral authentication methods with thousands of inconsistent accounts.

    So far, the good ideas I can see about authentication is Kerberos with openldap and some other modules, for example:
  3. Monitor and Log Analysis:

    The purpose of monitor is to monitor and record the activities of the matchines, mostly the status of the performance such as CPU/Memory/Disk etc, and find the bottleneck, sometimes the status of appointed processes is necessary.

    The most import application for monitor is SNMP tool, such as net-snmp. Based on SNMP, we can make use of rrdtool and cacti to make status graphs for all the hosts and services and display them on the pages of one host: the monitor center.

    There are two main purposes of log analysis: report the sensitive info, and statistics. For the previous one, there often be a filter base on regular expression, to refine the most information such as security points, faults or failures, for example, logwatch or swatch. So this can also be considered as part of monitor.

    For the latter one, there are now no unified tools for all services, because the formats are always different.

  4. Documents

    For this large environment, it's not a good idea to put all the docs as 'man' or 'html'(make use of CLI browser such as elinks/w3m/lynx) format on every machines, because it's not convenient to search! Especially for the specific informations for the specific company or orgnazation. Sometimes the docs should be changed and upgraded periodically, and many people is working on them, this make us to centrialize the docs and make revision control on them, and add searching utilities.

    The documents should be written in one kind of markup languages, such as xml/sgml docbook, TeX/LaTeX or txt2tags as this document. These formats can make the documents been changed to other formats such as html, pdf or each other easily.

  5. Storage and Backup

    There may be hundreds of solutions for storage and backup, most of them are commercial, expensive. Maybe we need a simple one. And in my opnion, some revision control is part of backup.

  6. Configuration and Control

    I will talk about this later, after we talk about the duplications of the architecture. This substystem will play import role for the reusable environment.

Duplications in System Administration

There is a very import rule in development: DRY(Don't Repeart Yourself), it is easy to achieve this because you just need to think about one application as a programmer, But as a system administrator, there are too many data duplications between various applications and hosts have to be taken into consideration.

To describe the data duplications, I must explain my conception of the data in computer. I think there are 3 types of data:

  1. software environment
  2. configuration
  3. runtime data

The data of software environment is mainly packages of applications(and the documents), this is the most static data of the computer.

The data of configuration is mostly the settings in configuration files, sometimes these settings are in databases. It's more dynamic than software environment, but more static than runtime data. And a very important characteristic of configuration is readability, it's a type of User Interface (UI).

So the code(software environment) decides what the program can do, while the configuration decides what the program should do. The runtime data is the object the program process on.

The runtime data is mostly in ralational databases, it's totaly dynamic, and direct readability is not necessary -- there are tools to access those data.

Duplications in Software Environment

There are many Linux or BSD distributions, every have its own binary packages manager. To avoid this type of duplication, we'd better choose one of them for our architecture environment to reduce the cost.

But the binary packages have vital limitations: they are platform dependent, and the charateristics are fixed which can't be changed, for example, the RPM mutt of Fedora has no function for libesmtp, if I need this, I still have to compile libesmtp and mutt sources to add this feature -- If there are 100 hosts need this modification, how to? Especially this 100 hosts contains several platforms, such as x86, amd64, ppc etc. Or there is several OS or distributions: Debain/Ubuntu, RedHat/Fodora, Gentoo, LFS, FreeBSD ... After all, in this individual times, it's not a wise idea to restrict people's ways and customs of using computer.

So as you can see, this leads to duplications.

Red Hat/Fedora of course has SRPM, it's also a type of source compiling control, thus why not build the whole system from source directly, in fact this is the spirit of UNIX and Open Source.

There are several source package manager, such as the simple paco, complicated RedHat/Fedora's SRPM or FreeBSD/Gentoo's port/portage system.

The port system maybe a good idea, but ultimately the user's customizations are necessary, so how to make this customizations be reusable?

RedHat/Fedora SRPM or Debian/Ubuntu dpkg-source are also good choices, but I would rather to use the package manager I have written, especially for the needs of LFS(Linux From Scratch). I will talk about all of them in the later chapters, and the installation templates based on them for reusability.

Duplications in Configuration

There are two types of configuration duplications. One is duplications between different applications, sometimes the duplications even appear in different setting part of one application.

For example, many applications have to set some ip address or hostname info, you have to type these ip address in different configuration files in different formats. When the ip changes, you have to change all the different files, and there is no guarantee of that you will never forget something.

You put /dev/input/mice setting into /etc/X11/xorg.conf for Xorg mouse, you then still have to put the same value for gpm. So Change one will not automatically change the other.

...

I think the Apache's httpd.conf is also a good example:

You have defined DocumentRoot /var/www/html , but you still have to write <Directory /var/www/html> rather than <Directory $DocumentRoot>

If you set apache's log as logs/access_log, you also have to manually set /etc/logrotate.d/httpd, adding the path name of the log file. The same, change one will not make the other being changed automatically.

Writing scripts to do the changing automatically? Maybe. But the formats are so different, you have to write a lot of scripts, then to maintain the consistent of all these scripts generate duplications which becomes nightmare.

Another configuration duplications are between hosts. You always have to login to every host to modify the configurations.

Maybe we can make use of configuration sharing, but so far I have not seen a good solution that can satisfied me, especially for the single point failure of the configuration sharing. Another problem of configuration sharing is that most config are same, but small part are not, this may make the whole files can't be shared because there no import or include machanism.

I met this situation: two departments maintain two systems, the settings of this two systems in databases are similar, but formats are different, and somethings are different. Ignore the political factors, the inconsistent of the systems while updating makes conflicts between departments.

Text is good for configuration, but sometimes edit a text file is not so convenient, because your eyes or hands have to pay time for the searching in the context.

Yes, many settings are simple to modified, but if there is hundreds to thousands things have to be changed in these conditions, it's not simple.

More examples of services configuration duplications ...

More examples of hosts configuration duplications ...

The previous duplications can be considered as space duplications, there also a time duplication: you edited the configuration file, and several days or weeks latter, you may forget what you have modified. If there are several administrators, how to know who does what modifications?

These can be solved by Revision Control, but before builting the Configuration Sharing Mechanism for the previous duplication problems, it's not easy to achieve this goal.

Other Duplications

Other duplications of the systems mostly come from the NOT WELL FORMED architecture, or NOT GOOD utilities.

For example, the lack of authentication and authorization mechanism leads to the mass of auth mechanisms for different services, and a member maybe have to remember several accounts(more for system administrators); of course you can build same account for one in different services, but it's a nightmare to modify the passwords or accounts information for all the services and hosts.

If there is an aa center, it is very easy to disable an account, just LOCK it in the authentication center(as 'usermod -L' in a single system), then the others' authorization problems can be ignored.

Someone take copy or replication as backup, I think that's wrong. If the backup tools have not considered the restore/recovery process, and has no good manage organization, you will have write many unmanagable scripts for backup and do many manual manipulation while restore.

We talked about documents, by that way, we can reduce most duplications in documentary manager, just choose a right markup language to write the documents with aids of revision control, and make the formats converting automatically -- there are many tools to do these.

A system administrator may have to pay too much attentin on many details and copies in a low reusable system environment. There is a proverb: devil is in details, I also think devil is in copies. Because the copying is mostly the responsibility of machines, not human's. If the copying has to be achieved manually, it's a nightmare.

Aims for Reusability Design

To make such system architecture as a compact system, building all the subsystems that mentioned before becomes a tool chain problem, the key is circular dependencies among different subsystems, thus they have close connections, just like building a LFS system, all the packages of which are compiled from source. But the building process becomes complicated, it's low efficiency to build it manually every time, that the purpose of reusability.

While maintaining, there also many duplications which we have discussed, to sovle those problems, reusability is the key.

So to make this big system runs effectively, the reusability becomes a very important mark and fundation.

The reusability will also imporve the orthogonality of the whole system. By design (1) the new syntax(mini language) of configuration and the corresponding APIs, and (2) the new configuration sharing mechanism with the aids of revision control.

After building an architecture, with the aids of reusable mechanism, all the relative things will be recorded, include packages information and configuration settings. All of these become an template.

Other users can build new architectures based on this templates, adjust for their own needs, then generate new templates. These new templates can also been used by others ...

So at last, we can build a new Linux distribution with templates of solutions, or improve the efficency of the existing distributions. And get more and more choices more easily ... Maybe a whole new world?

So what in my mind is the fundation of the all that infrastructure of an system architecture.

Configuration: High Syntax Level Control, Mini Language

Format(Data) Structure Design

Existing Configuration Formats

Data driven is a good mode for design, usually the code decides what a program can do, and the configuration decides what a program to do. So now our problems become what is the more logical data structure for the settings of a program?

I have said that there are hundrends types of configuration formats for maybe thounds of appclications, which leads to duplications among different services and hosts. But we can still find the similarities of them. In my opinion, there are 3 types of configuration format: Tree key-value, Table and Sequence List.

Let's have a look at them:

  1. Tree key-value
    1. Simple key-value pair:
      • mutt muttrc:
          my_hdr From: chowroc.z@gmail.com
          set edit_headers=yes
          subscribe 
          set pop_user="chowroc.z@gmail.com"
          set pop_pass=********
          set pop_host="pops://pop.gmail.com:995"
          set smtp_host="smtp.gmail.com"
          set smtp_port="587"
          set smtp_auth_username="chowroc.z@gmail.com"
          set smtp_auth_password="********"
          set smtp_use_tls_if_avial=yes
          ......
        

      • vsftpd vsftpd.conf:
          xferlog_enable=YES
          xferlog_file=/var/log/vsftpd.log
          local_umask=002
          user_config_dir=/etc/vsftpd/users
          chroot_local_user=YES
          userlist_deny=NO
          ssl_enable=YES
          force_local_logins_ssl=YES
          ......
        

    2. Plain text tree:
      • kernel sysctl.conf:
          fs.quota.syncs = 14
          kernel.printk_ratelimit = 5
          kernel.modprobe = /sbin/modprobe
          kernel.panic = 0
          vm.max_map_count = 65536
          vm.swappiness = 60
          net.ipv4.tcp_syncookies = 0
          net.ipv4.icmp_echo_ignore_broadcasts = 1
          net.ipv6.route.max_size = 4096
          dev.scsi.logging_level = 0
          ......
        

      • postfix main.cf:
          myhostname = mail.sample.com
          mydomain = sample.com
          myorigin = $mydomain
          mynetworks = 127.0.0.0/8 192.168.0.0/24
          mydestination = $domain
          smtpd_sasl_auth_enable = yes
          smtpd_recipient_restrictions = permit_mynetworks, permit_sasl_authenticated, reject
          smtpd_client_restrictions = 
          smtp_always_send_ehlo = 
          smtp_connect_timeout = 
          smtp_connection_cache_on_demand = 
          smtp_connection_cache_reuse_limit =
          smtp_connection_cache_time_limit =
          smtp_connection_reuse_time_limit =
          lmtp_sasl_auth_enable = yes
          local_destination_concurrency_limit =
          local_destination_recipient_limit =
          ......
        

        The sysctl.conf is the most classical plain text tree format, the postfix main.cf same, just used a different dilimiter between nodes.

        The simple key-value pair format can also be considered as the special type of plain text tree -- the depth is only 1.

    3. Some other complicated formats:
      • Windows ini: Samba smb.conf, MySQL my.cnf
      • Apache httpd.conf:
      • Bind named.conf

        These configuration file formats can still be considered as tree key-value: the depth of 'Windows ini' is just two at the most, and three depth at the most for Apache httpd.conf. The delimiters between keys and values are '=' or spaces(\s or \t), and different depth nodes are seperated by some XML like marks.

        Of course these formats are not standard tree key-value, some other styles are mixed in, but the most parts are tree key-value like, and the nonstandard lines can be converted to tree or other structure which I will talk the next.

    4. XML and LDAP

      Some application use XML as the configuration file(fontconfig, e.g.), as we know , XML of course is tree style.

  2. Ralational Table

    There are many classical table formats, for example: /etc/passwd /etc/group /etc/hosts /etc/hosts.{allow,deny} /etc/fstab /etc/service

    Postfix also has many tables, usually are /etc/postfix/{access,generic,virtual,aliases,... etc}.

    PAM's /etc/pam.d/*, the pam_limit's limits.conf and pam_time's time.conf can also be considered as tables.

    Since it's table, the information can be organized by ralational database, such as SQL databases. Some applications make use of that, for example, postfix can be set to retrieve the access/virtual maps from MySQL, and proftpd/pure-ftpd can be set to retrieve authentication information from MySQL, etc.

  3. Sequence List

    The most information in this type of configuration file is file pathnames. for example, the /etc/ld.so.conf.

    /etc/shells /etc/filesystems /etc/vsftpd.ftpusers /etc/fsftpd.user_list and pam_listfile are this type of configuration file too.

Format Choice and Organization

Since the configuration file format is regular, the goal can now be set as: choose the right formats and design the correspoding APIs.

This format design can also be considered as a mini-language design, this mini language only describe the configuration data structure, rather than a procedure -- the interpretation will be performed by the API and the application itself.

As we can see, the tree is the mostly flexible type, which has been applied the most widely. It's very suitable to apply it when the meaning of the configs varies line by line, mostly this type of configuation does not contain too large amount of information.

So my design must center on the tree format, to constitute the trunk of the configuration system(We now only talk about the single computer system for more good understanding). When it is necessary, the branches and leaves can point to tables and lists.

Among several style of tree format, which is the best choice? For example, the very popular XML? Or choose the LDAP for all configuration at all -- it seems simple and straightforward.

But in my opinion, if we use file as configuration, the best format is 'Plain Text Tree' as sysctl.conf, or say Flat Tree/Line Based Tree It's the most clear and intuitionistic style, which has the highest readability. While the readability of XML is more lower with the <tags> anywhere in the file.

Only choose LDAP make the whole system too fixed, the flexiblity and readability are too bad. The configuration will be too centralized which leads to 'single point failure' easily.

The plain text tree is very easy to be processed by the tranditional UNIX tools such as grep, sed, awk, ...etc. Since the goal is the design of the APIs, then when I import the module, the parameters that I will transport to the interface need to be as simple as possible for better human memerios, the line based string is a good idea, rather than XML/LDAP description blocks. For example:

  from caxes.ctree import CTree
  cfile = '/etc/fs_backup.conf'
  cto = CTree(...)
  config = cto.get(cfile, 'backup.datadir')
  ...
  config = {}
  config = cto.get(cfile, 'restore.*')
  ...
  config = cto.get(cfile, 'upm.install.commands.*')
  ...

For more better managability, it's a good idea to make use of revision control for the configurations, the XML will lower the managability because it's not line based, and I don't know how to make of of revision control on LDAP. So the plain text tree is the first choice on this field.

Generally speaking, the plain text tree is the most simple format to end users, after all, configuration is not document.

But these line based string can also be converted to XML/LDAP blocks by some appropriate tools or APIs, and there may be somebody more prefer XML than this line string style, so we can also design those corresponding APIs, and make the converting/translation among all these formats easily and automatically.

By this way, the information can be stored in plain text tree, XML or LDAP or even any other storages, but the APIs to the end users are consistent, the user only need transport the simple line string to the methods of the APIs.

Tree cannot describe the whole world, so makes the leaf nodes point to tables and lists, for example, a plain text tree file can be:

  backup.records = 'ctable:///var/fs_backup/.table'
  backup.ids = 'clist:///var/fs_backup/id_table'

Then the API knows to retrive information from the given table and list, and make them easy to be used by the program calls it.

When use table as configuration, text style is the first choice, but the table format should has a little differences than the classical table files such as /etc/passwd or CSV. The table format should has some 'self-explanatory' characteristics, such as the fields' names and the seperator, so the APIs don't need to set that.

It's better if the API can also access the text table using standard SQL, there are some standard documents for this type of API, for example, Python has: Python Database API Specification v2.0 Thus we don't need to design a new mini-language for the table format.

Then we can also make the real DBs as the auxiliaries, just as LDAP for plain text tree.

The list is similar with table, and more simpler.

There can be a view of format structure organization like this:

  ** this graph should be shown with monospaced font **
  
  CTree(Ftree(Plain Text Tree), XTree(XML), LTree(LDAP))
     |
     |\--CTable(FTable(Text Table), DBTable(real DBs))
     |
     |\--CList(Flist, ...)

You can also have a look at the Big Picture for more details.

APIs and Mini Language Requirements

The previous section has discussed some views of the APIs and mini language, now let's go further.

Tree Configuration(CTree)

A basic format of plain text tree is like this:

  fs_backup.format = "bid, btime, archive, type, list, host";
  fs_backup.type.default = "full";

The corresponding API should like this:

  from caxes.ctree import CTree
  cfile = open('file:///etc/fs_backup.conf', 'r')
  ctr = CTree()
  config = ctr.get(cfile, optmap=['fs_backup.format'])
  print config
  # {'fs_backup.format' : 'bid, btime, archive, type, list, host'}
  config = ctr.get(cfile, prefix='fs_backup', optmap=['format', 'type.default'])
  print config
  # {'format' : 'bid, btime, archive, type, list, host', 'type.default' : 'full'}

But it's better to define a Tree class to make the operation more intuitionistic:

  class Tree:
  	......
  
  # now CTree.get() will return a Tree instance, so
  config = ctr.get(cfile, optmap=['fs_backup.format'])
  dir(config)
  # [..., 
  #	_Tree__node_value, _Tree__node_items,
  #	__setattr__, __setitem__, __getitem__, __add__, __iadd__
  #	__traverse__, __update__, _one_node_set,
  #	fs_backup, 
  # ...]
  print config.__prefix__
  # ''
  repr(config.fs_backup)
  # <Tree instance at ...>
  dir(config.fs_backup)
  # [..., _Tree__node_value, _Tree__node_items, __str__, format, type, ...]
  print config.type()
  # 'increment'
  print config.fs_backup.type.__dict__
  # { 'default' : <Tree instance at ...>, ...}
  print config.fs_backup.type.default()
  # 'full'
  
  config = ctr.get(cfile, prefix='fs_backup', optmap=['format', 'type'])
  print config.__dict__
  # { 'format' : <Tree instance at ...>, 'type' : <Tree instance at ...>, ...}
  print config.format()
  # 'bid, btime, archive, type, list, host'
  print config.type()
  # 'increment'
  print config.type.default()
  # 'full'

In fact this can create a wholly new data structure: Tree, in python, to act as a built type. The detail of Tree design can be found at the section: New Tree data structre.

Futhor more, the Tree structure can be used directly by the python programs and scripts as module, thus the configurations can be written as a python module rather than a new mini-language being described here, which can furthest eliminate the text parsing process.

The tree configuration also must support the normal comment style: '#'

  fs_backup.default = "full"; # comment
  # more comment

so do multi lines mode, for example:

  test.long.value = """The value of the options
  also support
  multilines mode""";

There should be vairable support:

  fs_backup.type.default = "full"; # comment
  fs_backup.type = ${fs_backup.type.default};
  # fs_backup.type = "incr";
  fs_backup.datadir = "/var/task/fs_backup";
  fs_backup.tagfile = ctable:"$datadir/.tag";
  # Or use absolute path:
  # fs_backup.tagfile = ctable:"${fs_backup.datadir}/.tag";

Sometimes wildcard will be very convenient:

  config = ctr.get(cfile, optmap=['fa_backup.*'])
  # as a dict:
  print config
  # {'fs_backup.format' : '...', 'fs_backup.type.default' : 'full', 'fs_backup.type' : 'incr', 'fs_backup.datadir' : '/var/task/fs_backup'}
  # or as a Tree instance:
  dir(config.fs_backup)
  # [..., __str__, format, type, dataidr, tagfile, ...]

Further more, the regular expression support can be added too.

Since we have talked about the format structure and organization, the support for cross file parsing should also be added.

  fs_backup.others1 = "ftree:///etc/others1.conf";
  fs_backup.others2 = "ldap://hostname/...";
  fs_backup.others3 = "xml:///etc/others2.xml";
  
  fs_backup.others4 = "ftree:///etc/others4.conf/root.branch1.leaf2"

A New Data Structrue: Tree, in Python

The document of "Tree" is in the code and unit tests, which can be check out from the svn repository: https://crablfs.svn.sourceforge.net/svnroot/crablfs/caxes/trunk/lib/ The files are tree.py and tree_ut.py

HTML doc

Table Configuration

SELECT * FROM __CURRENT__ WHERE ... class textdb:

For Tree: SELECT d2.value FROM foo.d1

List Configuration

Orthogonality Design

Strategy Pattern

Configuration Sharing Mechanism

  ** this graph should be shown with monospaced font **
  
                  [meta-conf]
                       |
                       V
                  [subversion]
                       |
                  (sandbox)
                       |
    /--------/--------/ \------\
    |        |        |        |
    V        V        V        V
  {API}    {API}    {API}    {API}---\
    |        |        |              |
    V        V        V              V
  %task1%  %task2%  %task3%      %cluster%
                                     |
                                  .......
  

A Big Picture

data flow view

  ** this graph should be shown with monospaced font **
  
  /text/
  [service]
  {API}
  (directory)
  %program%
  <=manual-operation=
  $DB$
  
                 /<---------%convert%--------->/ftree/ <=modify=          
                 |                               |
     [LDAP]------|<--%convert%-->/xml/<=modify=  |
       |                           |             |
  %replicate%                      |             |
       |                           \--->[SVN]<---/         
       V                                  |
     [LDAP]                               |
       |                                  v
       |                          /---(sandbox)---\
       |                          |               |         
    {LTree}                       V               V         
       |                        /xml/          /ftree/~~~point~~~~~~~~~>$MySQL$
       |                          |               |       |                |
       |                          V               V       \~~~/ftable/     |
       |                       {XTree}         {FTree}           |         |
       |                          |               |              |         |
       \------------------------> \--->{CTree}<---/           {FTable} {DBTable}
                                          |                      |         | 
                                          V                      |         |
                                    %application% <---{CTable}-------------/

Configuration: Low File System Level, Sharing Mechanism

Configuration Sharing Mechanism

We talked about improving reusability via configuration syntax design in the previous chapter, that's mainly about reuse among different services, especially for those in one single computer system. But with the aids of some Network File System, and the "variable support", "cross file parsing", "include/import" characteristics, the configuration can be shared among different hosts. As you can see, the "variable support" and "cross file parsing" is line based, so it's a fine grit(or say: high level) control.

But it's not enough for reuse of configuration among different hosts only with the aids of syntax design, because:

  1. The current applications are not using that format. Maybe that is a good idea , but we still have to wait for many years for the existed excellent applications using it -- maybe it will never be used in the future, what reuse can we get for the transition period?

  2. Many characteristics of configuration syntax design is line based, and manily oriented to the fine grit reuse. If there are many files or many large blocks of text should be reused, it sucks(the relationships become noodles when you always use fine grit control). The only characteristic of block of text reuse of "configuration syntax design" is "include/import", which some nowdays configuration formats have, too, I will interpret it later in the discussion of configuration sharing mechanism.

  3. We want to reuse configuration files among hosts, we want to get more control on the modifications and changes of the files(what modifications are made by whom at when), which can be achieve by Revision Control. Configuration syntax design can't make use of revision control, but the Configuration Sharing Mechanism can.

So the Configuration Sharing Mechanism is about files sharing, or say, "based on files sharing".

It's not so easy to explain this configuration sharing mechanism, but let's have a try:

A very simple and direct train of thought is build a network file system, such as NFS/Samba, put the configuration files on that and share them to the hosts. But there are several problems:

  1. Influence of single point failure spreads immediately.

  2. Most of time, the configuration files are similar, but there may be still some subtle differences, how to manage this differences effectively and make the same parts still reusable?

    For example,

  3. The budget is limited, the cost must be reduced, so you may have to put several services' functions on one machine, or say, put several logical hosts on one physical host, to maximize the usage of the compute resource. Especially when add a host will make you add a hired box in the IDC room, e.g.

    For different physical hosts, the logical hosts differs but is similar, how can you achieve this efficiently. While the world changes, you must seperate one physical host to two or three, how to do that effiently?

    For example, you may want to put Monitor/LogFilter/Kerberos/openLDAP on 1 host, if there is already a monitor host's configuration files, the best way is apply them all automatically, but how to -- simple copy leads to duplications, because you will find:

Yes, if the cluster is used, the reusability among different hosts can be improved remarkable, for example, make use of LVS with GFS, the configuration files can be put into the shared storage(GFS) and all the real servers be same that can read the shared configuration files from the only one place.

But the real world is always more complex, only cluster can't eliminate the problems memtioned above. Especially the second and the third. Furthor more, this sharing is only oriented to services configuration files, not the whole hosts, and there are some other not so important issues can be considered:

Orthogonality Design

Now let me explain the principals of this configuration sharing mechanism. Of course, there should be a shared storage, and we also need the revision control for configuration files, we can choose subversion. The big picture like this:

  ** this graph should be shown with monospaced fonts **
  
      +-------------+     
     / a real host /
    +-------------+
           A
           |                    +--------------+
           |                   / shared files /
       implement              +--------------+
           |         unfold __/             \__
           |             __/                   \__ commit
     +-----|------+   __/                         \__     +------------+
    / real files /<--/                               \-->/ subversion /
   +------------+                                       +------------+

Let me explain the structure of the shared files at first. I think it's necessary to make every service as a logical host:

  sh$ ls templates/*
  templates/base
  templates/httpd
  templates/vsftpd
  ...
  templates/webserver1 -> LAMP
  templates/ftpserver1 -> vsftpd
  ...
  templates/LAMP
  templates/docs
  templates/monitor
  templates/logfilter
  ...
  
  sh$ ls templates/httpd/*
  templates/httpd/.inherits
  templates/httpd/etc/ld.so.conf -> ../../base/etc/ld.so.conf
  templates/httpd/etc/skel -> ../../base/etc/skel
  templates/httpd/etc/rc.d -> ../../base/etc/rc.d
  templates/httpd/etc/snmp -> ../../base/etc/snmp
  templates/httpd/etc/logrotate.d/httpd
  templates/httpd/hosts.deny
  templates/httpd/etc/httpd
  ...
  
  sh$ ls templates/LAMP/*
  templates/LAMP/.inherits
  templates/LAMP/etc/ld.so.conf -> ../../httpd/etc/ld.so.conf
  templates/LAMP/etc/skel -> ../../httpd/etc/skel
  templates/LAMP/etc/rc.d -> ../../httpd/etc/rc.d
  templates/LAMP/etc/snmp -> ../../httpd/etc/snmp
  templates/LAMP/etc/httpd -> ../../httpd/etc/httpd
  templates/LAMP/etc/vsftpd -> ../../vsftpd/etc/vsftpd
  templates/LAMP/etc/my.cnf -> ../../MySQL/etc/my.cnf
  templates/LAMP/usr/local/lib/php.ini -> ../../../../httpd/usr/local/lib/php.ini
  ...
  
  sh$ ls templates/docs/*
  templates/docs/.inherits
  templates/docs/etc/ld.so.conf -> ../../LAMP/etc/ld.so.conf
  templates/docs/etc/skel -> ../../LAMP/etc/rc.d
  templates/docs/etc/snmp -> ../../LAMP/etc/snmp
  templates/docs/etc/httpd
  templates/docs/etc/httpd/conf -> ../../../LAMP/etc/httpd/conf
  templates/docs/etc/httpd/conf.d/sites.conf
  templates/docs/etc/httpd/conf.d/svn.conf
  templates/docs/etc/vsftpd -> ../../LAMP/etc/vsftpd
  templates/docs/etc/my.cnf -> ../../LAMP/etc/my.cnf
  templates/docs/usr/local/lib/php.ini -> ../../../../LAMP/usr/local/lib/php.ini
  templates/docs/usr/share/vim/vim70/syntax/txt2tags.vim
  templates/docs/usr/share/vim/vim70/filetype.vim
  ...
  
  sh$ ls templates/monitor/*
  templates/monitor/.inherits
  templates/monitor/etc/ld.so.conf -> ../../LAMP/etc/ld.so.conf
  templates/monitor/etc/skel -> ../../LAMP/etc/rc.d
  templates/monitor/etc/snmp -> ../../LAMP/etc/snmp
  templates/monitor/etc/httpd
  templates/monitor/etc/httpd/conf -> ../../../LAMP/etc/httpd/conf
  templates/monitor/etc/httpd/conf.d/virtuals.conf
  templates/monitor/etc/vsftpd -> ../../LAMP/etc/vsftpd
  templates/monitor/etc/my.cnf -> ../../LAMP/etc/my.cnf
  templates/monitor/usr/local/lib/php.ini -> ../../../../LAMP/usr/local/lib/php.ini
  templates/monitor/etc/rrdtool
  templates/monitor/var/htdocs/cacti
  templates/monitor/etc/postfix
  ...
  
  sh$ ls hosts/*
  hosts/www1 -> ../../webserver1
  hosts/www2
  
  sh$ ls hosts/www2/*
  hosts/www2/.inherits
  hosts/www2/etc/ld.so.conf -> ../../../../templates/docs/etc/ld.so.conf
  hosts/www2/etc/skel -> ../../../../templates/docs/etc/rc.d
  hosts/www2/etc/snmp -> ../../../../templates/docs/etc/snmp
  hosts/www2/etc/httpd
  hosts/www2/etc/httpd/conf -> ../../../../../templates/docs/etc/httpd/conf
  hosts/www2/etc/httpd/conf.d
  hosts/www2/etc/httpd/conf.d/sites.conf -> ../../../../../../templates/docs/etc/httpd/conf.d/sites.conf
  hosts/www2/etc/httpd/conf.d/svn.conf -> ../../../../../../templates/docs/etc/httpd/conf.d/svn.conf
  hosts/www2/etc/httpd/conf.d/virtuals.conf -> ../../../../../../templates/monitor/etc/httpd/conf.d/virtuals.conf
  hosts/www2/etc/vsftpd -> ../../../../templates/docs/etc/vsftpd
  hosts/www2/etc/my.cnf -> ../../../../templates/docs/etc/my.cnf
  hosts/www2/usr/local/lib/php.ini -> ../../../../../../templates/docsusr/local/lib/php.ini
  hosts/www2/etc/rrdtool -> ../../../../templates/monitor/etc/rrdtool
  hosts/www2/var/htdocs/cacti -> ../../../../../templates/monitor/var/htdocs/cacti
  hosts/www2/etc/postfix -> ../../../../templates/monitor/etc/postfix
  ...
  # be carefull that it also links to 'monitor', not only 'docs'

As you can see, most of them are symlinks rather than be simply shared. By this way, we can still get all benefits of simple sharing, and exert more precise controls.

All of this configuration files can be commited to the subversion, thus now we have the revision control.

The unfold process on the previous graph is copy out the symlinks to be real files, the core concept of it is very simple, for example:

  sh$ ct_unfold hosts/www2
  # will actually wrap this command:
  #	/bin/cp -Lrf hosts/www2/* /mnt/realfiles/www2/
  
  # and:
  sh$ mount
  ...
  //www2/homes on /mnt/realfiles/www2 type cifs (rw,mand)

I think it's better to make the unfold process finishes the actual files transport between the configuration center and the real host, by one of push and pull way. The previous example use pushing method.

While the implement process copies the real files to the relative real host's right local locations.

According to the previous graph and the unfold principles, we can say it's totally orthogonal: you can modify and commit the configuration files without regard for the influences to the real host; only when you perform the 'unfold', the modifications take effect; and you know who does the modification. If the subversion or shared storage is down, the real host can still run normally.

Of course, if there are only these be offered, it offers no manageability! Maintain all the symlinks manullay is not less than a nightmare! so we need some utilities. Let me explain this:

At the first, we have a base system template, templates/base, we need a tool to 'inherit' from it:

  sh$ ct_inherit -t templates/base templates/httpd

it should create the right symlink.

As you can see, the OO thoughts have been applied here. So we can take all the subdirs of 'templates' as class definitions, they are super class templates or sub class templates, and all the subdirs of 'hosts' can be considered as instances.

Then we need a tool to 'custom' some settings of 'httpd' which makes it a httpd server rather than a base system:

  sh$ ct_custom vi templates/httpd/etc/logrotate.d/httpd

it will remove the symlink of 'templates/httpd' -> 'templates/base' that created before, and create a real dir of 'template/httpd', create symlinks of 'templates/httpd/*' -> 'templates/base/*' except 'templates/httpd/etc' subdir, then 'mkdir templates/httpd/etc' and 'ln templates/base/etc/* -s templates/httpd/etc/*' except 'templates/httpd/etc/logroate.d', ... At last add a new file 'templates/httpd/etc/logroate.d/httpd' and open it for editing.

It also should record a file 'templates/httpd/.inherits' that indicate that it inherits from 'templates/base'.

If the target file exists, it should report that and ask whether to continue:

  sh$ ct_custom vi templates/httpd/etc/hosts.deny
  This target inherits from 'templates/base'
  	do you want to cut off the relationship,
  	or track upwards?(no/yes/track) 

if 'yes', it should copy the symlink to a real file, and open it for editing.

If 'track' is chosen, it will follow the symlink to its super class template, and do this recursively(so ask for cut off or track upwards again if it is a symlink).

Of course there should be a '--force' option to skip the prompt:

  sh$ ct_custom --force vi templates/httpd/etc/hosts.deny

or

  sh$ ct_custom --track-end=base vi templates/httpd/etc/hosts.deny

There are some other operations should can be done, such as:

  sh$ ct_custom cp -r /tmp/httpd/* templates/httpd/etc/httpd/

the last argument should always be the target!

In the previous structure examples, the host instace 'www2' does not only have the symlinks to 'templates/docs', but also has symlinks to 'templates/monitor'. Thus we can say it multi inherits from 'templates/docs' and 'templates/monitor':

  sh$ ct_inherit -t templates/docs -t templates/monitor hosts/www2

in this condition, the 'hosts/www2' should look for the files to symlink to 'templates/docs' at first, and then to 'templates/monitor', the overlapped files should be recorded to 'hosts/www2/.inherits' and obey the first otherwise there are files conflicts: the 'same file' of 2 templates have different contents , it should prompt to the user to solve the problem manully.

By this way, 2 templates can be merged to 1, but you may want to merge 2 to 1 without inherit(copy the contents), maybe I can add a program ct_merge to do so, and ct_inherit make use of it.

If a super class template has been modified, all the sub class and instance templates that inherit from it should be notified to the administrator who does the modification.

  sh$ ct_custom vi templates/templates/LAMP/etc/httpd/conf/httpd.conf
  ...
  These templates and hosts will be impacted by this modification:
  	1. templates/docs
  	2. templates/monitor
  	3. hosts/www2
  do you want to continue?(yes/no/exclude) exclude 3(exclude host/www2)

of course there should be a '--force' option.

inheritance tree?

Scripts as Configuration

All Configuration Issues

Templates of Configuration

Package Management

Currently there are only a simple README file in this subproject to describe it, you can also found the hints from LFS hints:

[http://www.linuxfromscratch.org/hints/downloads/files/crablfs.txt]

The basic principals can also be found from LFS hints:

[http://www.linuxfromscratch.org/hints/downloads/files/more_control_and_pkg_man.txt]

A package manager has many functions, here I only talk about how to add the reusability into it, and make use of the reusability. It will also make use of configuration syntax control features to reuse the installation profiles, and configuration sharing mechanism to reuse the installation templates.

User Based Package Management System

  1. Maximize the usage of basic OS characteristics
  2. More reusability
  3. More security
  4. ...

Current problems:

  1. Packages' Dependencies
  2. ...

Build LFS/BLFS Automatically by UPM

Templates of Installaion

Backup, Recovery, and Realtime File System Mirror for High Availability

Overview of Backup and Recovery

zh_CN

Backup always be considered as the most important thing in system administration, and there seems many solutions, both comercial or open source.

Because of my shallow, I have not used one of the comercial backup system so far, the only I know is that they are not cheap, and because my experience of some comercial softwares, I don't think they can fit my needs, and I guess that they may bind you to some specified hardwares.

Let's have a look at the backup of the file system, the backup of database is different and now is not my concerns:

Maybe you will write some scripts to do tar of the dirs/files cronly, but the archives will lose managability soon, because it's not easy to classify what have been added to the archives and there may be many archives for different parts of the file system.

The lack of incremental/differencate backup is also a weekness of "simple tar", And because tar can never find what has been deleted, the recovery will be filled with outdated scraps. You can make use of find to solve this problem by a low effecient way, BUT:

The most important: you will also be stuck with a real product server's file system with more than millions of dirs/files, to uncompress from archives will be very very slow ... In fact, just traverse this type of file system once can consume too much time. Thus the efficiency and convenicence of recovery is not taken into consideration.

So with nowdays fast, large and cheap disks, some synchronization mechanisms are applied for backup and recovery, such as rsync, amanda ... And with the hard link ability of Linux, you can get a whole full backup of every sync point with actually incremental modifications(or say snap like incremental backup, or say rotate ...).

There is a ruby application named pdumpfs that make use of both rsync and rotate(pdumpfs-rsync), I used it under FreeBSD for a period.

With such tools, you can rescue from crash soon with the data at least may be hours ago, for example, if you backup a host like this:

  This graph should be displayed with monospaced fonts:
  
      +----------+                +----------+
      |  worker  | --- rsync ---> |  rescue  |
      +----------+                +----------+

You can replace the "worker" host by "rescue" host quickly, just make several symlinks on "rescue" link to the right directories.

A more possible topology is like this, because it's more cost-efficient, and the previous one asks for "couple hot backup" , thus more expensive:

  This graph should be displayed with monospaced fonts:
  
      +----------+ 
      |  worker  | --- rsync -----------\
      +----------+                      |
         ......                         |
                                        |
      +----------+                      |
      |  worker  | --- rsync -----------\
      +----------+                      |
                                        V
      +----------+                +----------+
      |  worker  | --- rsync ---> |  backup  |
      +----------+                +----------+
           |                            |
      [take_over]                       |
           |                            |
           V                            |
      +----------+                      |
      |  rescue  | <------------------ NFS
      +----------+

By this way, you get a buffer, thus get more time to recovery the crashed worker host. It's explict this model is low cost, since you can backup and rescue(HA) more than 3 hosts with only 2 hosts (multi to one).

The reason NFS is used is because recovery directly by copying back the files takes too much time thus is impossible, for example, a real product online system attched with SCSI disks, 100 MBit NIC/wires and 50 GB data which is near 2 milions dirs/files(100 thousands dirs) can take at least more than 2 hours to copy the files back (you can build a SNMP/Cacti to get the statistics and compute the time).

Although the SCSI interface is target0:0:0: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 62) (dmesg), here MB/s is MBps(MBits/s), the 3 MBytes/s is the actual expectable upper limit sum of disk read/write to retrive such many small dirs/files, and 30 Mbits/s expectable upper limit of network traffic, so disk I/O is the bottle neck.

If you transport a very big file such as several GBytes, the result is very different: more than 11 Mbytes/s for disk I/O and 98 MBits/s for network traffic, thus network becomes the bottle neck, and the performance of disk and network is taken full advantage of.

We must concern the former situation, because that is the real condition we have to face to, to compute the time will be consumed:

  sh$ echo "50 * 1024 / 3 / 60 / 60" | bc -l
  4.74074074074074074074

That means you have to take near 5 hours to recovery from the backup! Even takes DISK I/O to be 6MBytes/s:

  sh$ echo "50 * 1024 / 6 / 60 / 60" | bc -l
  2.37037037037037037037

more than 2 hours is needed! On the whole it's unacceptable!

So the key is: ELIMINATE or REDUCE copy and traverse operations when rescue as far as possible

The reason why the different situation between small and big files roots to the basic structure and principles of file system itself. Take Ext3fs as an example, The physical disk is split to many groups, and the metadata of dirs/files scatter on different parts of the disk, so to transport such small files, OS kernel has to get all the metadata of them and makes too much time on disk seaking; while big files has their data on several sequential area on disk.

Maybe you can make use of RAID, I think RAID10 can expect mostly 2 times in speed(you must also take the write on target side into consideration) , thus more than a hour is needed. And this is only the time of copy, after copy, you may need to change the ownership and permission of the files, which need traverse thus is time consuming, and several relative operations may need to be done too ...

I'm not clear about Reiserfs, but I don't expect it can speed up for 5 times than Ext3fs, even with the aid of RAID, I can only expect 6 times in speed, and with the relative other operations, I think 1 hour is needed at least.

But even with the rescue method metioned above, you still lost the data from last time rsync runs, because:

rsync mechanism is not perfect(of course there is nothing perfect on this world), because rsync is not "realtime". Every time rsync is invoked, it must get a list of files to be transport, and to retrive such list, it has to traverse the whole source part of the file system, compare the timestamp of every source and the target file, and then do corresponding operation. Along with the file system grows larger and larger, more and more time, higher and higher system overhead will be consumed for every synchronization, thus you have to increment the interval, however you original mind is to minimize it, since you will lose more files/data with a larger file system and bigger interval. Maybe for a critical service, this loss is not acceptable.

And rsync has another "vanished" problem, that is, between it get the list and do the actual transport, the file system may varied, some new dirs/files may be created/modified, and some dirs/files may be deleted. The former is not a critical problem, but the latter may cause problems. For example, pdumpfs-rsync will terminates when "vanished" problem occurs and leaves the rest of synchronization unfinished! But I think rsync is robust enough and can just ignore this warnings.

No matter what mechanism you use, tar with find or rsync, ..., there is still a challenge: usually you just want to backup the specified parts of the file system, and exclude some dirs/files in those parts, such as cache files generated by the website's php scripts, ...; and these parts identities should can be managed easily, be better to managed by the computer itself after you specified them the first time.

So far, for file system realtime replication/mirror backup, I have not found an open source solutions, so I write this one and hope it will be useful to you. (In fact, just yesterday(18 September 2007), I heard of some realtime solutions, some are commerical such as PeerFs, while some are open source such as DRBD, so far I'm not very clear about them, just has some concepts, so I will do some experiments to learn them, maybe there are better solutions.)

I must say, this C/S application, I named mirrord/fs_mirror, in the cutils package, is not an absolute realtime backup tool, it's only "near realtime", that is, in normal condition(normal load average and disk I/O, enough network bandwidth), the delay between the host runs mirrord agent and the host runs fs_mirror client is less than 1 second. The resource it consumes is ignorable. For the details, jump to the later "Design" sections.

And it's explict that so far mirrord/fs_mirror is only suitable for the file system with small files, rather than big files that vary frequently such as logfile or database. Some other applications can be applied for them.

As a summary, lets look at these aspect about backup and recovery:

  1. There are two types of backup: mirror/replication or say hot backup, and periodical backup.

    To make the recovery as fast as possible, mirror is necessary. Another mainly purpose of mirror is to deal with hardware failure (although RAID can deal with some disk failure situations, it can't cope with CPU/memory/power/... failures, it's too expensive to make all of them redundant).

    But mirror can not deal with these: misoperation(there may be thousands of subdirs in the data should to be backup, for thousands of users, someday, there may be an user told you he/she want to rollback to the data several days ago :P), virus and intrusion, if one of these occurs, you can only rely on the last peridical backup archives.

    So the backup must contains both of mirror and periodical backup function , and luckily both of them can be afforded by mirrord/fs_mirror.

  2. For faster recovery, the files copy or directories traverse must be limited as far as possible, so the best way is replace the crashed server with a rescue host, or export the relative dirs from backup host to the rescue host to buffer the presure, the topology is similar with that of rsync mentioned above, and the details will be discussed in later sections (If you only care about how to use mirrord/fs_mirror, just jump to the "Usage Manual" section).

  3. Manageability is also important, you may want to only backup several parts of the whole file system, and separate these parts, such as platform dependent and independent. For the included parts, there still some subparts you want to be excluded. mirrord/fs_mirror afford these function by import the module fs_info.

The design concepts and cases will be disussed in the next section, if you just what to know how to use mirrord/fs_mirror, please skip it to the section "Usage Manual".

Design mirrord/fs_mirror

inotify

To achieve mirror, the application must know what has been modified in the interval(commonly less than 1 second) from last files transport, as known, rsync has to traverse the whole parts of the file system and compare the timestamps, which is too resource overhead and not realtime.

While mirrord/fs_mirror can avoid traverse and comparison by make use of a function afford by the recent Linux kernel(from 2.6.12): inotify.

inotify can monitor any modification of appointed directories, that is, when you create, delete, move, change attributes of dirs/files, or change contents of regular files in those directories (not recursive), inotify will raise an event, the application can catch and enqueue the events, find the specified file's name that leads to this event by the watch(described below), and do corresponding operation(here mirrord will record the action and the relative filename, then tell fs_mirror by transport the records, fs_mirror then do relative operation such as creating, deleting, moving or file transport).

Although inotify supply no recursive ability, the application can recursively add the directories to its watches list, these watches represent those directories have to be monitored by inotify, which is represented as descriptor digits in kernel, thus is not memory consumed. And because it is afforded by kernel, it is lightweight.

For more details about inotify, there are some introduction on Internet:

http://www.linuxjournal.com/article/8478 http://www-128.ibm.com/developerworks/linux/library/l-inotify.html http://blog.printf.net/articles/2006/03/25/lightweight-filesystem-notifications

Or you can read the source file of inotify in Linux

  ./include/linux/inotify.h
  ./Documentation/filesystems/inotify.txt
  ./fs/inotify.c

And there is a Python package pyinotify that mirrord/fs_mirror currently use: http://pyinotify.sourceforge.net/

The First Story

A simplest story comes from one of my colleague, that is to make use of hard link to record the modifications of the file system, export this hard link records directory to remote backup host via NFS, and the daemon on backup host periodically moves the dirs/files in the exported record fs to local directories, like this:

  This graph should be displayed with monospaced fonts:
  
               (events)                              \
     [inotify] ----> [mirrord]           worker host \ rescue/backup host
         |               |(invoke)                   \
         |               |                           \
  __original-fs__ == HardLink ==> __records__ == MoveOnNFS ==> [fs_mirror]
                                                     \

Although it is simple, this model is problematical. There are several problems can't be solved by this "only hard link" model:

  1. There is no way to know what is "deleted", thus fs_mirror can't sync "DELETE" operations, and same to "MOVE" since "MOVE" implictly contains some types of "DELETE".

    In fact, "MOVE" and "DELETE" behaves differently in inotify (will be disussed in later sections), and a simple hardlink can't distingush this differences.

    You may suggest to build a "DELETE" directory that only records the "DELETE" operations? But this makes the model more complicated, and:

  2. The order of the operations is important (since it's impossible to delete a file before it is created, or write a file before its parent directory does not exist, ...), while "only hard link" can't record the order of the operations!

  3. fs_mirror must move the dirs/files from the record directory via NFS (or any other Network File System) in time, otherwise the record directory will become too large, which will lead to performance reduction.

  4. Here, moving by fs_mirror is problematical, because, it is not an atomic operation. As known, "move" consists of "copy" (in fact here it must "copy" via NFS, so is more complicated than simple "copy") and "remove", so it can't be completed with only one system call, even "copy" may invokes several or more times of read/write system call, this can lead to records missing.

    Have a look at an example:
      This graph should be displayed with monospaced fonts:
      
          file1 <=== hardlink === file2
            A                       |
            |                       V
       WRITE_CLOSE                 (mv)
      
        procedure:
      
            |                       |
            |                       |<-- COPY_COMPLETED
            |                       |
            |<-- WRITE_CLOSE        |
            |                       |
            |                       |<-- REMOVE
    
    WRITE_CLOSE is an event raised when file writing occurs, if you have read the documents of inotify, it's not hard to understand this. When the event raised, the inotify processor do corresponding "hardlink" operation from file1 to file2, let's assume that the file2 has exist by the previous WRITE_CLOSE, thus this time's WRITE_CLOSE has to ignore the "OSError: file exists" exception, so when "COPY", only the content modified last time has been transported, and "REMOVE" unlinks the hardlink "file2" just "created a moment ago(maybe several milliseconds)", without synchoronize the content modified this time, thus it it is missed. If the file does not change until the catastrophe of the machine, you totally lose this modification. Only one file is not a serious problem, but there is no guarentee that there will only a few files will be affected by this principle. And as we known, "COPY" itself is not atomic, which will lead to more missed holes in the target files.

  5. There may be the needs to mirror the file system to several clients, or say "mulit-mirror". This simple hardlink can not achieve this since after fs_mirror "moved" the hardlink file, the record is eliminated.

  6. So it is more like a script solution, rather than a productive solution, since backup is a critical application, it is better to make it productive with strict tests, rather than several irresponsible scripts.

  7. It is not flexible enough for simple hardlink, because you must play all files on same file system, that means you must adjust every host. In fact, it is annoy to setup NFS on every working host, and this "push" model is not security enough -- you have to export the critical data via NFS.

  8. At last, since you have to create structure on the physical hard disk, there may be performance penalty.

Therefor a more powerful strategy should be chosen for record the modifications of the file system. Since mirrord currently use pyinotify, I suggest you have a look at it.

The Second Story

It is known that a instance of the class inheriates from "pyinotify.ProcessEvent" can be used to do the corresponding operations of the events, for example, if there is a method def process_IN_CLOSE_WRITE(self, event): the inotify "IN_CLOSE_WRITE" events will be processed by this method in a loop you programed, for example:

  while not self.Terminate:
  	...
  	if wmNotifier.check_events(wmTimeout):
  		wmNotifier.read_events()
  		...
  			wmNotifier.process_events()
  			...wmNotifier.process_events()

Here wmNotifier is the instance of "pyinotify.Notifier" (please refer to the pyinotify document if you are not clear), when wmNotifier.process_events() is called, it invokes the "ProcessEvent" inheriated instance's process_* methods, "IN_DELETE" events corresponds to "process_IN_DELETE", "IN_WRITE_CLOSE" corresponds to "process_IN_WRITE_CLOSE" as memtioned above.

So what have to be done is to code such instance and its process_* methods, we can name the inheriated class "a Processor".

Not all events raised by inotify is important, for file system mirror, there just several events should be take into consideration: IN_CREATE (with IN_ISDIR for only directory) IN_WRITE_CLOSE (only for regular file) IN_DELETE IN_MOVED_FROM IN_MOVED_TO

There is another event "IN_ATTRIB" that is very useful yet but it can only be applied for the next version because of my immature design for mirrord the first time.

What this Processor should do, or say what the process_* should be? Let's play the eyes on another problem at first: what should be watched.

If you have read the documents of inotify and pyinotify, you know that only the dirs/files that have been watched by inotify will raise the events, and it is your program's responsibility to add the dirs/files to watches. So you must clarify what to be added to the inotify watches by call inotify_add_watch() of inotify or WatchManager.add_watch() of pyinotify.

To clarify what should be added to the watches, or say, be monitored by the inotify, is a manageability problem mentioned before. A first rule we should know is: Only the directories need to be monitored. Because any modification on the subdirs/subfiles in those directories can make inotify raise the events, and you can get the monitored dir's path name and the subdir/subfile's name by pyinotify's "Event" instance that relative to the event. In fact pyinotify only monitor directories.

There is a recursive problem mentioned above, that is, when you add a directory to the watches of inotify, only the modification of the first level subdirs/subfiles can cause inotify raise events, for example, if "/var/www/html" is monitored, modification of "/var/www/html/index.php" or "/var/www/html/include/" can be catched, while "/var/www/html/include/config.php" or "/var/www/html/include/etc/" can not.

This means if you create directories by mkdir -p /var/www/html/include/etc and "/var/www/html" is monitored, only "/var/www/html/include" will cause inotify raise the "IN_CREATE|IN_ISDIR" event. The same rule for os.makdirs in Python, and any copy/move recursively operations!

So it is the program(mirrord)'s responsibility to add all the necessary directories to watches. This leads to a traverse of some parts of the file system, which is resource consuming, but this traverse only need to be done once, after this named "boot init" stage, the resource can be released and mirrord will run as a daemon.

Although pyinotify.WatchManager has a "rec" argument for its add_watch(), I don't perpare to use it, because:

  1. It does not have the function to exclude some subdirs, for example, the "cache" directories, or direcories with big files vary frequently.

  2. This "rec" is static, that means when the dirs/files entities have been deleted or moved out, the inotify watches will not adjust correspondingly. In fact this mainly occurs for "IN_MOVED_FROM", you can have a look at a prototype program "prototypes/mirrord_inotify_delete.py" in the svn source "trunk" directory of ulfs/cutils.

    It is necessary to avoid those watches "corpses" leave in the system, which may cause the mirrord become too corpulence , and because current pyinotify only use RAM to store the watches, it will become memory wasting.

Here, a module fs_info.py is made use of for including/excluding. It is a file system identity system, you can use it to split you file system to several parts by it with include/exclude configuration, and use the method function it afford to find a list of all the dirs/files fit the config.

For example, you can add some dirs/files to a file system identity by this way:

  sh# fs_info -a t:/var/named \
  -a t:/var/www/html \
  -a x:/var/www/html/cache \
  -a x:/var/www/html/log website

Here fs_info is a command line interface which will invokes the corresponding methods of the module fs_info.py depends on the options, "-a" means "append", and t/x has the similar meaning with "-T/-X" of tar, or say, "include/exclude".

The identity is "website", this means a file system identity "website" contains several directories and their subdirs/subfiles, and when you invoke mirrord, you can pass this identity as an argument to the mirrord.py module's Mirrord constructor, then the instance can use FsInfo.find() to get a whole list of the subdirs/subfiles of the dirs without the excluded ones contained by this identity.

fs_info.FsInfo will build a directory "/var/fs_info/website", and record the included(t) dirs/files in /var/fs_info/website/.t_files and excluded(x) dirs/files in /var/fs_info/website/.x_files. You can adjust the location by modify the configuration of fs_info: options.datadir = "/var/mirrord" in /etc/python/fs_info_config.py.

So at last we get this view:

  This graph should be displayed with monospaced fonts:
  
               (1)---> [mirrord] ---(2)---> [fs_mirror] ---(3)
               /                                             \
          [fs_info]    /==============(4)===============> [fs_sync]
             /        /                                        \
      +--------------+                                  +---------------+
      | original     |                                  | mirrored      |
      |  file system |                                  |   file system |
      +--------------+                                  +---------------+
  
  (1) tell it what to monitor for mirror
  (2) tell it what are modified
  (3) tell it what to transport
  (4) the actual regular files been transporting

By (2), fs_mirror knows what have been changed, thus call fs_sync to do the corresponding operations: the directories creating, dirs/files deleting and moving can be performed on the client side itself, so it consumes no resource of the server; but the content modifications of the regular files can't be mirrored unless the files are copied from the server. This copy can be performed by any network file transport protocols and tools, such as FTP, NFS, SMB/CIFS, SSH or rsync ..., so there can be many implemetations for fs_sync.

Some furthur instructions about this sync mechanism can be found at the end of The Fifth Story: Monitor.

After explained the "fs_info" and (1), let's come back to the previous question: What should the "Processor" (To avoid confusion, let's call it "Monitor") do? This is a part of function of "mirrord" and (2) in the view above. While for more details about fs_info, please read the code and internal document.

From the previous picture and the first story, we know that mirrord has to:

  1. Collect path names to inotify watches, the file names can be collected by fs_info.FsInfo.find() in the "boot init" stage of mirrord.Mirrord.

  2. Manage the file system snapshot. It is another useful characteristic, that is a collection of all the path names of the current file system (with several status info relative to them). There are several reasons why snapshot should be used, which will be shown later.

  3. Create a Monitor instance to do corresponding process_* actions when the relative events are raised by inotify, include adjust the watches when delete/move or copy/move recursively events occurs, and record the modifications of the file system.

  4. Transmission the record of modifications to the fs_mirror (2), but it is only do the transmission when fs_mirror asks for, because this pull model is more security and flexible (with push model, the works of other aspects of mirrord will be influenced when can't push, otherwise mirrord has to be more complicated). So mirrord should be a server agent, and fs_mirror is the client.

    It is suitable to transmit this record via socket, this leads to design a rational protocol between the agent and the client.

  5. Can serve several clients for "multi-mirror", that is, several clients can synchronize the file system concurrently, maybe further more (in the future versions), synchronize different parts of the file system concurrently.

  6. Schedule several threads or processes that do different things. At the least there are 2 threads: A main thread and a schedule thread.

    The main thread run an infinite loop to check, use the inotify "Monitor" to read and process the inotify events, the blocking period is several seconds(4s in default) if nothing happens, otherwise it is invoked immediately.

    Since mirrord has to serve the fs_mirror client via socket, at least there should be a thread listen on a port waiting for connection and request, the main thread has been occupied for Monitor to process the inotify events, so a schedule thread is necessary (another solution is select/poll). It is called "schedule thread" for "multi-mirror", because once a host runs fs_mirror asks for synchronization, a new "server thread" should be spawned to serve that client.

    Server threads mainly deal with the protocol relative things.

  7. Share some variables such as modification record and file system snapshot between the threads, mainly the main thread and server threads.

The Third Story: Record Log

In the first story, we have seen a type of mechanism for modification recording, which has been proved not a good idea.

After that, the first breaks into my head is to setup four arrays: CREATE (IN_CREATE|IN_ISDIR only for directories), FWRITE (only for regular files writing), DELETE (IN_DELETE), MOVE (IN_MOVED_FROM and IN_MOVED_TO), and makes any path name relative to the modification be appended to the corresponding array.

My first design is to manage such arrays for every server thread, that is, every time a server thread is created, it builds these arrays for itself, because the status of every clients varies.

To manage these groups of threads relative arrays, it forces every thread to implement a inotify "Processor", or say, a "Counter", to append modifications to the arrays, and clear up the arrays once the client has read the records.

This design is also problematical:

If we take a reference to MySQL replication, the right solution raises. What should be done is to setup a Log to record all types of modifications with the corresponding path names. Let's name it "wmLog" (Watch Modify Log)

This wmLog should be shared by the main thread and the server threads, the "Monitor" in the main thread append the new modifications to the tail of wmLog, and the server threads can read the wmLog with a pointer indicates its read position, thus a server thread knowns what it has processed.

What should this wmLog like? Since the order is very important, a list (sequence) is fairly direct, but a hash table is more flexible and clear. If you have the needs to modify the wmLog, for example, delete the overdue records older than nearly 1 month to avoid the wmLog being too large, hash table is fairly useful; and with the hash table like APIs, it's easier to switch to other logging mechanisms(for example, Berkeley DB, discussed below).

To make the hashed wmLog ordered, its key is a serial number. Every time an inotify event is raised and processed, the serial number is increased by 1, so the wmLog looks like this dictionary in Python:

  {
      1 : ('CREATE', '/var/www/html'),
      2 : ('CREATE', '/var/www/html/include'),
      3 : ('FWRITE', '/var/www/html/index.php'),
      4 : ('DELETE', '/var/www/html.old'),
      5 : ('MOVE', ('/var/www/html', '/var/www/html.new')),
      ...
  }

So the pointer owned by every server thread is just an integer, which can be different from main thread's serial number and each other, which indicates the different reading progress of different clients.

This mechanism makes it possible to restart client synchronization from the broken point, for example, when the main thread has increased the serial number to 10234, a server thread may just read to the position 9867, then the client may be terminated for some reasons while the main thread's serial number goes on increasing. The next time the client restarts, it can tell the server thread to transmit wmLog from the point 9867, otherwise the client has to ask for a whole new sync init, that force the server thread to resend the whole content of the file system snapshot, this resending is resource consuming(since you can always use generator of Python, it is not so memory consuming, but still CPU and network traffic consuming), and the following regular files transport is more resource consuming.

But unlike MySQL replication, so far there is no implementation to restart the mirrord from the broken point. Because between 2 times boot of mirrord, the file system may be changed without monitoring, then a simple way is to rebuild the whole snapshot from the head again, and reset wmLog to 0, this also force the client to redo sync init too, because it's necessary to guarentee the consistency of the original file system and the mirrored file system, mirrord achieve this by require the client afford the server thread a MD5 session generated upon the time it started.

One of a future version will improve this boot strategy by compare the "current" file system when booting and the snapshot last time terminated, to compute the modifications between the 2 boots, this can makes the boot init more smooth.

wmLog is a hash table, so Python's dictionary is fairly direct, but it is in memory, while the Log will be larger and larger. If we assume the average length of 1 record is 30 bytes, and the rate of modification is 1/s, then it will consumes "30 * 86400 / 1024 / 1024 = 2.47 MBytes" memory, and you can control the length of the wmLog, so you can control the upper memory limit. If the orignal file system is not so large, or the modification of it is not so frequent, it is a good idea to put the records in this in memory wmLog.

By contrast, it is sensible to use a hash like database, such as Berkeley DB. Only a simple class fs_wmlog.WmLog is created to wrap the operation of Berkeley DB, because wmLog only accept integer while BDB only accept string, this can keep consistency.

At first I use the in memory dict, but the product line I must deal with contains very large file systems with millions of dirs/files, for simplicity, this version totally turns to BDB, but I will implement both of them in the future.

My first implementation is to combind both of them to one API, I mean there are both a dict and a Berkeley DB in wmLog then, like this:

  class WmLog:
  	def __init__(self, path):
  		self.memlog =  {}
  		self.dbdlog = bsddb.btopen(path, 'n') 

the most recent records are in memlog, the earliest memlog will be put into Berkeley DB when it grows too large. Then I found it's not necessary to do like this, because it's more convenient to rely on the BDB's memory management.

So far only the actions with the relative pathnames is logged and transmitted, but the files' status is also useful, furthor more, with a hostid, it is possible to make the 2 or more hosts mirroring each other. The lack of them roots to the immature of the design this first time, and will be improved in the future.

The Fourth Story: File System Snapshot

A file system snapshot is a data structrue that stores all the path names with their relative status, such as file type(directory or regualr file), permission and ownership, size and last modified time. In this version, only the pathnames and file type is contained in the snapshot, a future version will improve it.

There are several purposes why file system snapshot is necessary:

  1. The first purpose is to achieve faster access of a file's name and its status, especially in the stage of the client sync init.

    From the example described in Overview, we know that the performance of the disk access is relevant to the size and number of the files, and a large file system with many small files can not be accessed fastly, especially for walking. So a data structure is necessary to accelerate the procedures, either in memory or on disk (database) file. In fact it's possible to implement several mechanisms for this file system snapshot, which will all be shown later.

  2. The second purpose is to lock the snapshot when the client doing sync init. During the procdure of client sync init, it's nearly sure the orignal file system will be modified, while the client can only know the serial number the sync init starts (although it's possible to transmit the serial number when the sync init ends, it will make the problem more complex). This leads to inconsistency between the 2 file systems, because the modifications of the original file system during sync init will be performed priorly than the next coming wmLog.

    By locking the snapshot and the read pointer for a specified server thread when the client does sync init, it's always sure that the mirrored file system is "consistence" to the orignal file system, here "consistence" only means that the mirrored file system never contains anything more than the wmLog has told to the fs_mirror, while the delay between the 2 hosts remains and can not be eliminated because of the basic principles of this mirroring.

    Another effect of locking is that make sure there always only one client doing the init sync at the most, since it's a resource consuming procedure.

  3. The problem of inotify's IN_MOVED_FROM events have been described above, that is only the top dir/file names can be detected when moving occurs, since the dirs/files have been "removed", there is no way to know what subdirs/subfiles those top names contain to remove them from the inotify watches correspondingly, except with the aid of file system snapshot.

    With this retentivity of vision, mirrord can exactly get all subdirs/subfiles and do the correct operations.

  4. In the boot init stage, mirrord includes/excludes some directories by fs_info.FsInfo, but when the inotify Monitor begins running, mirrord loses this ability. This specially explict in the juncture of the include and exclude, for example, "/var/www/html/tin" is included, and "/var/www/html/tin/xex" is excluded, if "tin/xex" is "DELETE" first and "CREATE" subsequently, "tin/xex" and its subdirs will be added to inotify watches, which is not your expectation.

    While with a snapshot, it is possible to implement a method to do the judgement.

    Maybe it's better to do the judgement by mirrord itself to include the FsInfo instances and make FsInfo has the corresponding method such as __contains__

There are several implementations to achieve the file system snapshot. The simplest is a in memory hash table(dict), the key is the pathnames and the values is the relative status of the files(so far only 0/1 to represent the type of the file), but it is too memory consumed, and can't get the names of subdirs/subfiles easily, except traverse the whole keys and compare them (to the top pathname that raises the IN_MOVED_FROM inotify event). Let's call this implemetation as "PureDict".

Another implementation is to create a tree like data structrue to store the pathnames. I have defined a basic Tree class, which can be found in the caxes subproject of ulfs, but here I implemented a different one, which I may will add to caxes project.

With this data structure, the internal storage of a record is like this:

  >>> fssnap['var']['www']['html'] = 1
  >>> fssnap['var']['www']['html']['index.php'] = 0

but it still should has the hash like APIs, to make the operations more convenient and intuitive, and be consistence with other implementations of the file system snapshot, so these operations should be valid at least:

  >>> fssnap['/var/www/html'] = 1
  >>> fssnap['/var/www/html/index.php'] = 0
  >>> x = fssnap.pop('/var/www/html')
  >>> for fname in fssnap: print fname
  >>> for fname in fssnap.keys(): print fname
  # Generator should be used more wildly than keys() method,
  #	because keys() can be too memory comsumed.

This implementation can save some memory, and the retentivity of vision funcionality can be achieved easily by this data structure. Let's name it as "SnapTree".

The third way is to build a real file system that only contains the dirs and empty regular files which have the same pathnames as the orignal file system. This snapshot can be on an Ext3fs, or on Reiserfs for more better performance (since the regular files on the snapshot file system is empty, I guess this can get more better performance), or even on Tmpfs. Let's name it "Snapfs".

All these three implemetations (and the rest implemetations that have not been described so far) can be found in the module fs_snap.py, and here is a comparison of the boot init stage by using them separately:

All files number: 2939516 Dirs number: 104609 Empty dirs number: 51671 Record file size: 206MB (find /absolute/path/to/the/part >/tmp/record.txt)

- PureDict Snapfs SnapTree1 SnapTree2
Mem consumed 388MB(19.2%) 417MB* 261MB(12.9%) 254MB(12.6%)
Boot time 9min 20min ~10min 10min
CPU 20% 25% 15% 15%
Load average 1(4 CPUs) 1.5 1 1
Disk I/O R:2MB/s R:1.5MB/s R: 1.3MB/s R:<1.3MB/s
- W:2MB/s W: >2MB/s W:<0.3MB/s W:<1.3MB/s

* Snapfs basically not consumes memory, only consumes disk space. This result only reflects the situation on Ext3fs, the situations on Reiserfs have not been tested. Maybe it's possible to make use of RAID(RAID0) too, to improve the performance.

Tmpfs seems a good idea, but it's impractible if you research it deeply, because every inode has to occupy a single page(4096KB), although it is possible to limit the inode size, it's not convenient and can not solve the problem radically. Maybe it's possible to put most of the snapshot on swap, but I have not found the effective resource limit methods(setrlimit in Python, or limits.conf of PAM can not act as my expectation).

You can use the command df -i to check the number of the inodes that have been consumed.

If there is a effective resource limit method, I think it also can be performed for PureDict/SnapTree.

There is another file system's comparison: All files number: 5951114, record file size: 416MB Dirs number: 227534, record file size: 12MB Empty dirs number: 115314, record file size: 6.3MB

- PureDict Snapfs SnapTree1 SnapTree2
Mem consumed 781MB(38.6%) -* 530MB(26.2%) 514MB(25.4%)
Boot time 19min - ~23min ~23min
CPU time 6:27.77 - 10:32.68 11:01.55
CPU 18% - 14% 14%
Load average 1.3(4 CPUs) - 1 1
Disk I/O R:1.5MB/s - R: 1.3MB/s R: 1.3MB/s
- W:0.5MB/s - W: 0.4MB/s W:<0.4MB/s

To avoid the influence of other aspect of mirrord, check the script mirrord_consumed.py in "cutils/trunk/prototypes" subdir, it contains nothing except the snapshot implementations above, you can check the memory it consumed by top command for the different mechanism separately.

Now there are two types of snapshot: w(atched)dirs and (regular)files, this also comes from the immature design the first time, and cause the "get subdirs/subfiles names when IN_MOVED_FROM" function that mentioned above can not act expectly. So until now only the previous 2 requirements have been achieved, the third(subdirs when IN_MOVED_FROM) will be completed in the next version after merging the wdirs and files to just only one snapshot structure, so the next version will contain this function:

  >>> x = fssnap.pop('/var/www/html')
  >>> print x
  {'/var/www/html' : 1, '/var/www/html/index.php' : 0}

The snapshot mechanisms described are all in memory solutions. As for wmLog, the Berkeley DB can also be applied to fit the needs of large file system with millions of small files.

BDB of course is hash like, then how can it find the subdirs/subfiles to solve the "IN_MOVED_FROM" problem? This can be resolved by BDB's BTree access method, because BTree is sequential, so when a directory raises an event, call BDB's set_location() to the prefix key path, and asks for next() continuously until key.startswith(prefix) return False or DBNotFoundError is thrown. By overload the pop() method of the hash like BDB implementation, we can get the same effect above.

Althoug BDB put the data file on disk, it has its own memory management and the performance is acceptable, and a further improvement of the performace I can image is to put BDB's data file on an RAID0 array, isn't it?

There is another requirement: smooth boot. As known, it's resource consuming to setup the snapshot and do the client first sync. If the client(fs_mirror) is terminated unexpectly, It can be restart from the broken point with the aid of wmLog by submit the serial number of the broken point and the corresponding session. But if the server(mirrord) is terminated unexpectly, it has to rebuild the whole file system snapshot when reboot, and the client has to redo the whole first sync again, because the server has no idea about what has modified during the two boots.

To solve this problem, it necessary to dig the potential of the snapshot and wmLog. Since we have the old snapshot at the broken point, it's possible to compare it with the current file system and compute the differences, make these differences to be the normal records in the wmLog, the server can keep using the old session and serial, and the client can skip the whole resync procedure!

The Fifth Story: Monitor

wmLog and snapshot are operated correspondingly when the inotify events occurs. Then when and how these operations being invoked is done by inotify Monitor, as memtioned in the second story.

Besides the operations described in the second story, there are several other functionailities the "Monitor" should contains:

Manage the watches via the API afforded by pyinotify.WatchManager, as discussed above, the wd(watch descriptor) will not removed from watches when a dir/file is moved, so Monitor has to do this cleaning up, and update the file system snapshot correspoindingly.

On the other hand, the recursive problem makes inotify only raise "IN_CREATE" event for the top directory when you mkdir/cp/mv recursively, so once Monitor catch an event on directory, it should run a walk in that directory, find all its subdirs recursively to watches. Although there is a traversing, in common condition, it's necessary to only walk a very small part of the file system, thus it is not resource consuming.

There are "missed" and "vanished" problems when Monitor perform this "walk adding watches" action, let me explain them now.

As we have known, mirrord/fs_mirror is not absolute realtime solution, it processes the inotify events in intervalic loops, the modifications occurs in this loop can only be processed in the next loop, then the missed problem comes like this:

  This graph should be displayed with monospaced fonts:
  
                                    |
                modifications       |        operations
                         ------ next loop ------
                                    |
                                    o<---(listdir(root))
                                    |
            (create root/subdir)--->x
       (create root/subdir/dnew)--->x
       (fwrite root/subdir/fnew)--->x
                                    |
                                    o<---(add root to watch)
                                    |
                                    o<---(listdir(root/subdir1))
                                    o<---(add root/subdir1 to watch)
                                    |
                                    o<---(listdir(root/subdir2))
                                    o<---(add root/subdir2 to watch)
                                    |
                                   ...
                                    |
                         ------ next loop ------
                                    | 

As shown above, in the gap between "listdir(root)" and "add root to watch", a new dir "root/subdir" is created, since "listdir" has been done, the application will not walk into this subdir and add it to watches correspondingly; and since the root has not been added to watches, the modification in it will not raise inotify events, thus you lose "root/subdir" and its subdirs totally -- all of them have not been added to watches, and any further modifications in them will not be notified to mirrord! They are "missed"!

An applied solution is to do the listdir() again just after the "add root to watch", the second list will find out what have been missed and walk into them forwardly.

While just a moment ago (at 24 September 2007), I think why not to "add root to watch" before "listdir(root)"? Although this will lead to adding overlap, that is, the newly created "root/subdir" will be add to watches twice separately by "walk adding watches" and "inotify process" , but it is not a serious problem and can be ignored smoothly.

But that means I must implement my own walk function rather than os.walk(), since os.walk() always do listdir() first.

I will try to achieve this in a future version since it is an immature design hole, but before that, I must make sure that the iterator/generator will be invoked sequential rather than parallel, for example:

  def walk(root):
  	if os.path.isdir(root):
  		yield root, 1
  		for file in listdir(root):
  			walk(file)
  	else:
  		yield root, 0

will this "walk" pause after "yield root, 1" and waiting for "root" been processed (sequential)? or it will call "listdir(root)" immediately (parallel)? If the former, this method is usable, otherwise there are problems unless there is any relative solution. To write an prototype script is necessary for me, or you already have the answer. If you know the solutions, would you please give me some tips, thank you very much :D

While the "vanished" problem plays on this way:

  This graph should be displayed with monospaced fonts:
  
                                    |
                modifications       |        operations
                         ------ next loop ------
                                    |
                    (delete dir)--->x
                    (create dir)--->x
                    (delete dir)--->x
                                    |
                         ------ next loop ------
                                    |
                                    o<---(process IN_DELETE 1)
                                    o<---(process IN_CREATE 2*)
                                    |
                                    o<---(process IN_DELETE 3*)
                                    |
                                   ...

problem occurs at "process IN_CREATE 2*", when this operation is invoked , it need to call os.walk() and os.listdir(), both require an actually existed directory, but the direcotry has been deleted(vanished) by the previous modifications, this cause OSError being thrown.

"process IN_DELETE 3*" may has problems if errors in "process IN_CREATE 2*" is ignored simply, because it has to delete the inexistent items from the snapshot and watches, it is not exact and leaves holes in the system if ignores the errors again!

This can applied on "MOVED" events since "MOVED" also has the recursive problem, thus it call "walk adding watches" too. For example:

  This graph should be displayed with monospaced fonts:
  
                                    |
                modifications       |        operations
                         ------ next loop ------
                                    |
     (MOVED_TO from unmonitored)--->x
                                    |
     (MOVED_FROM to unmonitored)--->x
                                    |
                         ------ next loop ------
                                    |
                                    o<---(process IN_MOVED_TO *)
                                    |
                                    o<---(process IN_MOVED_FROM *)
                                    |
                                   ...

Since before the "process IN_MOVED_TO", the dir/file has actually been removed(vanished), thus it will not be added to the snapshot and watches, then the next "process IN_MOVED_FROM" stucks by raising KeyError.

The solution of the "vanished" problem can be found in the source code and the relative unit tests.

By contrast to the "vanished" problem, there is an "explode" problem for clients. For example:

  This graph should be displayed with monospaced fonts:
  
                                    |
                modifications       |        operations
                         ------ next loop ------
                                    |
          (FWRITE /path/to/file)--->x
                                    |
          (DELETE /path/to/file)--->x
                                    |
      (CREATE DIR /path/to/file)--->x
                                    |
                         ------ next loop ------
                                    |
                                    o<---(process IN_FWRITE_CLOSE *)
                                    |
                                    o<---(process IN_DELETE)
                                    |
                                   ...

the first operation "process IN_FWRITE_CLOSE" meets the problem, because it tell the remote client to sync a regular file, but the target is actually a directory! At the same time, the client also has the "vanished" problem, all the solutions about these client side problems will be discussed later.

A very useful requirement is "dynamic watch adding/deleting" ability. In the privious sections, I described in the boot init stage and "walk adding watches" procedure, it is necessary to make use of fs_info to include/exclude some dirs/files, but sometimes there may be found some dirs/files that are too big or varies too frequently(or say, the value of size*freq/min is too big), then it is useful if I can remove those dirs or files from the fs_info.FsInfo and exclude by invoke a command, such as sending a signal:

  sh# mirrord -k exclude "/var/www/html/site.com/log/20070924.log"

Or on contrast, you want to add several dirs/files into watches:

  sh# mirrord -k include "/var/www/html/include/new/"

so far this function has not been achieved.

The reason why the IN_CREATE only performed on directories is because for regular files, IN_WRITE_CLOSE is more meaningful, and IN_CREATE for files becomes redundant.

IN_WRITE_CLOSE indicates that the content of a regular file has been modified, as you can see in all descriptions, only then the client fs_mirror will asks for an actual file transport, while the other type modifications can be performed on the client itself consuming no resource of server.

This file transport is carried out via any existed file transport protocol, such as FTP, NFS, SMB/CIFS, etc...,

A very important characteristic is that MOVE can be executed locally, rather than be considered as CREATE/FWRITE after DELETE, the later often requires an actual files content transmission, which requires more resource of the server.

But there is no guarentee the "MOVE" will only be considered as "MOVE", there is still possibilities it will be considered as "CREATE/FWRITE" followed "DELETE", becuase IN_MOVED_FROM and IN_MOVED_TO are two events with only a cookie to relate them, once an IN_MOVED_TO event is detected , the Monitor checks the corresponding cookie of the IN_MOVED_FROM, then this pair becomes a "MOVE" action and enqueue to wmLog. If no such pair, IN_MOVED_FROM will be considered as "DELETE" and IN_MOVED_TO will be considered as "CREATE/FWRITE". Monitor decides that there is no pair in several loops (now is 2), that means if IN_MOVED_TO arrived 3 inotify process loops after the arriving of IN_MOVED_FROM, it will be considered as "CREATE/FWRITE" because the corresponding IN_MOVED_FROM event has been cleaned up, since Monitor can't wait IN_MOVED_FROM's IN_MOVED_TO pair for ever, so that IN_MOVED_FROM is considered as "DELETE".

The Sixth Story: Schedule and Protocol

inotify Monitor runs an infinite loop, to process the inotify relative things specially, which means it must runs another infinite loop to listen on a socket port to process the requirements from the clients ask for synchronization, except making use of select/poll (So which is the better one? Now I used the child threaded shedule).

This is the responsibility of the schedule, it just waits for several command: START (a new server thread for the client), or TERMINATE (by and only by the main thread) ...

So far, the schedule just run as a thread, and only spawn child server threads when the client requires for synchronization. But as I known, because of the Python GIL, the Python threads can not maxmize the usage of multi CPUs, so what about shedule processes?

Once the server thread started, the client communicates with it via TCP socket, so it's necessary to design a rational application protocol.

The current protocol for the client that asks for an init sync:

  C: START
  S: OK
  C: INIT
  S: SESSION:8c0a4927a5280905e6f3ca01d5a02a53
  S: SERIAL:3279
  S: /var/www/html/
  S: /var/www/html/include/
  S: /var/www/html/syssite/
  ... # Only dirs
  S: EOF
  S: /var/www/html/index.html
  S: /var/www/html/index.php
  S: /var/www/html/include/config.php
  ... # Only regular files
  S: EOF
  C: NEXT
  # Server thread blocked until there is any modification ...
  S: SN:3282
  S: CREATE:/var/www/html/new
  S: DELETE:/var/www/html/tmp/s2124.html
  S: FWRITE:/var/www/html/files/s5237.html
  S: EOF
  C: NEXT
  # Server blocked ...
  ...

If the client gives an invalid command, the server thread should report that and terminate:

  C: THIS
  S: INVALID COMMAND
  
  C: START
  S: OK
  C: THIS
  S: INVALID INIT COMMAND
  
  C: START
  S: OK
  C: INIT
  S: ...
  ...
  C: NXT
  # Should be "NEXT"
  S: INVALID REALTIME COMMAND

This protocol, I think has some problems, and for the furture smooth boot and only one file system snapshot characteristics, the following style seems better:

  C: START\r\n
  S: OK\r\n
  C: INIT\r\n
  S: SESSION:8c0a4927a5280905e6f3ca01d5a02a53\r\n
  # Assume the upper limit is 3279 as the example above
  S: SERIAL:1024\r\n
  S: CREATE:/var/www/html/\r\n
  S: CREATE:/var/www/html/include/\r\n
  S: FWRITE:/var/www/html/include/config.php\r\n
  S: FWRITE:/var/www/html/index.html\r\n
  S: FWRITE:/var/www/html/index.php\r\n
  S: CREATE:/var/www/html/syssite/\r\n
  ...
  S: EOF\r\n
  C: NEXT\r\n
  S: SERIAL:2048\r\n
  S: ...
  ...
  S: EOF\r\n
  C: NEXT\r\n
  S: SERIAL:3072\r\n
  ...
  C: NEXT\r\n
  S: SERIAL:3279\r\n
  ...
  C: NEXT\r\n
  # Server thread blocked until there is any modification ...
  # And client waits for the response ...
  S: SERIAL:3282\r\n
  S: CREATE:/var/www/html/new\r\n
  S: DELETE:/var/www/html/tmp/s2124.html\r\n
  S: FWRITE:/var/www/html/files/s5237.html\r\n
  S: EOF\r\n
  C: NEXT\r\n
  # Server thread blocked and client waits ...

Here, the protocol becomes line based, makes it's possible to communicate with the server via telnet, and the first init sync has the similar behaviors as normal wmLog transmission, thus if the mirrord restart smoothly from the broken point and has computed the modifications between 2 boots, it only transmit those modifications as normal wmLog above (but only contains CREATE/FWRITE/DELETE actions, since detects the MOVED actions is not so easy).

I think it's necessary to limit the length of the records that transmitted every time, to avoid load overhead on the server side, thus here only 1024 records at the most is transmitted every time.

The following instructions will use this line based style to make the description more consistent, but the current version actually use the previous non line based style, this will be changed the next version.

If the client is terminated unexpectly and want to restart from the broken point, it afford the server the session and serial number, they are reserved by the client itself. The protocol is like this:

  C: START\r\n
  S: OK\r\n
  C: SN:8c0a4927a5280905e6f3ca01d5a02a53, 3282\r\n
  S: OK\r\n
  C: NEXT\r\n
  # Server thread blocked and client waits ...

It is more likely that the client only records the last transmitted serial number (3072), thus becomes this:

  C: START\r\n
  S: OK\r\n
  C: SN:8c0a4927a5280905e6f3ca01d5a02a53, 3072\r\n
  S: SERIAL:3282\r\n
  S: FWRITE:/var/www/html/program.php\r\n
  S: CREATE:/var/www/html/include/ext/\r\n
  S: DELETE:/var/www/html/tmp/s3984.html\r\n
  ...
  S: EOF\r\n
  C: NEXT\r\n
  # Server thread blocked and client waits ...

If the client gives the invalid session or serial number, the server will require the client to do the first init sync, then the client will send "INIT" command, or terminate without synchronization.

  C: START\r\n
  S: OK\r\n
  C: SN:60a50c8be3147f15fe57db3fb0216599, 3072\r\n
  S: SESSION INVALID\r\n
  C: INIT\r\n
  ...
  
  C: START\r\n
  S: OK\r\n
  C: SN:8c0a4927a5280905e6f3ca01d5a02a53, -1\r\n
  S: SERIAL OVERDUE OR INVALID\r\n
  C: INIT\r\n
  ...
  
  C: START\r\n
  S: OK\r\n
  C: SN:8c0a4927a5280905e6f3ca01d5a02a53, -1\r\n
  S: SERIAL OVERDUE OR INVALID\r\n
  C: NEXT\r\n
  S: INVALID INIT COMMAND\r\n
  # Terminate the server thread

I have described the server thread locking for the first init sync in the previous story, this locking can make the changes of several variables safe and atomic, especially for the shared variable between the main thread and server threads, such as the servers Pool (a structure stores the identity of the server threads instances), client_status, current processed serial number.

So the shared memory is locked by smLock, which is an instance of threading.Lock(). If the server thread get smLock, it will change the shared.client_status to inform the main thread, get last processed serial number (shared.serial) and make a trick to lock the file system snapshot as described in the previous story.

Trick? As we have known, to do the first init sync is resource consuming, and also time consuming, so if lock the snapshot totally, the main thread will be blocked too long and accumulate too many inotify events in the inotify queue unprocessed, this may lead to queue overflowing, while if the queue length is too long, it will be too memory consuming; or the main thread and other server threads may have to wait too long doing nothing, and when they come back, they have to deal with too many exceptions such as "vanished" problem and "MOVED" problem, which cause losing of the orignal purpose to average the load based on time and to be realtime as far as possible.

The trick can be achieved by the concept of pointer, that is, at first, pointer _snap points to shared.snap, after locked by smLock, it points to shared._temp_snap, and unlock; after the first init sync for the client, asks for smLock again and points back to shared.snap, do some operations to update shared.snap from shared._temp_snap, unlock at last, all the other parts of the application just only use _snap. This can be achieved very easily in Python since Python "always" use pointer. Please read the code for more details.

This trick requires there is only one client doing the first init sync at a given time. So a client_init_Lock (threading.Lock()) is used to lock between server threads. To avoid deadlock, the client_init_lock and smLock should be orgnized this way:

  class ServerThread:
  	def __init_send(self):
  		try:
  			client_init_Lock.acquire()
  			try:
  				smLock.acquire()
  				shared.client_status = CLIENT_INITING
  				serial = shared.serial
  				shared.svPool[self] = serial
  			finally:
  				smLock.release()
  			... # Do first init sync
  			try:
  				smLock.acquire()
  				shared.client_status = CLIENT_INITTED
  			finally:
  				smLock.release()
  		finally:
  			client_init_Lock.release()

Here the modification of the pointer is not done by the server thread, it is done by the main thread after acquired smLock, the main thread knows to do the pointer exchange by checking the shared.client_status.

Limitations of Pyinotify

So far I use pyinotify module to catch and process the inotify events. Maybe this is my shallow, I found that there are important limitations of using pyinotify, for example:

  1. Too thick cementing layer. This leads to more processes before the pyinotify Events can be raised and catched, thus may causes more resources being cosnumed.

  2. The inotify watches are instance of customed class defined in pyinotify, which is stored in memory, while no database is used to store the watches, and the watches contains too much information, thus consumes too much memory.

    You can read and run the prototype script "prototypes/mirrord_add_watch.py" to check this problem. Here is a example comparison table:

files num dirs num memory consumed
2011967(150M) 71333(4.1M) ~70MB
5951114(416M) 227534(12M) ~200MB

Maybe it's necessary to write another pyinotify implementation to fit these requirements. I will try to do this.

Usage Reference

Installation

The package contains mirrord/fs_mirror is named as cutils, which is a subproject of ulfs. uLFS means "Your own customed and easily managable LFS(Linux From Scratch) distribution", it contains a set of tools to achieve this goal, such as User Based Package Management, Configuration Sharing Management, and this file system backup and near realtime mirror (maybe I will implement a truely realtime mirror solution in the future).

To install the mirrord on the server that the file system is going to be mirrored, Python 2.4 or higher version is required, and pyinotify-0.7.1 is used for this version:

  sh$ tar xfz Python-2.5.1.tar.gz
  sh$ cd Python-2.5.1
  sh$ ./configure --prefix=/usr
  sh$ make
  sh# make install
  
  sh$ tar xfj pyinotify-0.7.1.tar.bz2
  sh$ cd pyinotify-0.7.1
  sh$ python setup.py build
  sh# python setup.py install

On the client side runs fs_mirror, only Python is required, pyinotify is unnecessary.

Then install caxes-0.1.2, which is another subproject of ulfs, you can download it from the same download page of cutils, it contains some assistant data structrues such as a Python Tree.

  sh$ tar xfz caxes-0.1.2.tar.gz
  sh$ cd caxes-0.1.2
  sh$ python setup.py build
  sh# python setup.py install

At last, install cutils-0.1.1:

  sh$ tar xfz cutils-0.1.1.tar.gz
  sh$ cd cutils-0.1.1
  sh$ python setup.py build
  sh# python setup.py install --install-scripts=/usr/local/bin

Simple Usage

Look at the graph in The Second Story, to backup or mirror a file system, the first things you have to do is tell the backup/mirror programs what to backup/mirror. You run fs_info to do so:

  server# fs_info -a t:/var/www/html system

this makes "/var/www/html" to be added to the backup/mirror included list, for the file system part with identity name "system", this identity is created when you first run this command, you can also use any names you want, such as "www", "web", "work", etc ...

Only the top dirs is necessary, so:

  server# fs_info -a t:/var/www/html \
  -a t:/var/www/html/include system

has the same net effect as the previous command ("/var/www/html").

To exclude some dirs/files without being backuped/mirrored, use the "x:" tag:

  server# fs_info -a t:/var/www/html \
  -a x:/var/www/html/cache system

this adds "/var/www/html/cache" into the exclude list.

If there are already lists, you can add them directly:

  server# fs_info -a tL:/tmp/included_01 \
  -a tL:/tmp/included_02 \
  -a xL:/tmp/excluded_01 system

The actual things fs_info does is append the items into $datadir/$identity/.{t,x}_files, thus you can also edit them directly, or use shell script's redirect function:

  server# find /var -maxdepth 1 >$datadir/$identity/.t_files

but fs_info can do several validation checking for you, for instance, the existance of the dirs/files. The choices are in your hands.

$datadir is default to "/var/fs_info", you can adjust it by editing the configuation file "/etc/python/fs_info_config.py" to be "/var/mirrord" to fit the needs of mirrord, which has the default value "/var/mirrord", and can be changed by editing the configuation "/etc/python/mirrord_config.py".

These configuration files use the data structure of Python Tree directly that defined in the caxes package.

$identity is default to "system", as mentioned above.

These configurations can also be changed by command line options, for example:

  server# fs_info -o datadir=/var/mirrord -a t:/var/www/html

Notice: The default kernel parameters of inotify is too small:

  16384 /proc/sys/fs/inotify/max_queued_events
  8192  /proc/sys/fs/inotify/max_user_watches

The value of "max_queued_watches" depends on the total dirs number of the file system part, you can use:

  sh# find $path -type d | wc -l

to count the number, and make sure the value is bigger enough than that.

"max_queued_events" means the max length of the queue managed by inotify, the more frequent the file system varies, the bigger this value should be. If you find message like "** Event Queue Overflow **", this means "max_queued_events" is too small, and the monitor after that point is not exact and is problemic, you should restart "mirrord" again.

After this setting, it is time to boot the mirrord to monitor and record the modifications of the original file system:

  server# mirrord
  # OR verbose:
  server# mirrord -v

You can always adjust the behavior of mirrord by changing the configuation file "/etc/python/mirrord_config.py" or command line options, such as the length of the wmLog (o.loglen, the longer, the more records will be reserved, which makes the client can restart from the break point with longer broken time, but may more resource and time consumed when read/update the wmLog), the bind interface and port, and the timeout of the socket when communicating with clients, or the identities collection of the file system parts that you want to be mirrored.

Wait for a while (depend on the hardwares, file system type you choose or the number the dirs/files you want to be backuped/mirrored), when the "boot init finished" message is printed out, you can switch to the client and start the fs_mirror.

A future version may eliminate this waiting stage and makes you can start the fs_mirror client immediately after the mirrord is invoked.

Before start fs_mirror, it is necessary to make the client can communicate with the mirrord and transport regular files correctly, thus you must permit the default port 2123 for mirrord (or choose another one), and the port for the protocol you choose for the regular files transport -- As described in the picture of The Second Story and The Fifth Story: Monitor, the design of mirrord/fs_mirror will support many file transport mechanisms, such as FTP, NFS, SMB/CIFS, etc ... But so far, there are still only two mechanisms: REPORT and FTP.

"REPORT" just print the actions reprented in wmLog recived by fs_mirror, although it does no actual dirs creating and regular files transporting , it is still an useful characteristic, talk about it later.

"FTP" sync mechanism will create dirs, delete or move dirs/files locally , while only the write of the regular files leads to actual file transport via FTP protocol, this can reduce the resource consuming on the server side. Notice that the whole file will be transport, not only the modified part, thus it is NOT a bytes level transport, so it is not suitable for the files too big or varied too frequently, or say, the multipled value of size and frequency is too large, or the value of "size / seconds" is too small, here seconds means the interval from time this file is transported first. A future version of fs_mirror will contain such computation feature and detemine whether to do the actual files synchronization.

These synchronization mechanisms are defined in cutils.fs_sync module.

It is a good idea to create an user named "mirror" or "backup" to own the read permissions of the file system parts that you want to be mirrored, this can be achieved by Linux's ACL function, then make this user can access the system from FTP, maybe you want to use FTP via SSL. So far no authentication mechanism has been implemented.

Edit the configuation file on the client to config fs_mirror. Read the example "etc/fs_mirror_config.py" for more detail.

Then, just invokes fs_mirror client simply:

  client# fs_mirror
  # OR verbose:
  client# fs_mirror -v

"fs_mirror --help" can give you more command line options information.

As an example, you may want to change options like this:

  client# fs_mirror -v -o host=roc \
  -o fssync.user=root \
  -o fssync.port=2121 \
  --password-from=/root/.fs_mirror/secrets:roc \
  --datadir=/data/fs_mirror/roc 

now, the mirrored dirs/files will be put into /data/fs_mirror/$hostname, and fs_mirror will read password from a secret file with the id of $hostname, the options specified by "-o"(--option) have the same meaning of o.* options in /etc/python/fs_mirror_config.py, which has been described. Here, the hostname and 2 files synchronization parameters have been changed, of course make sure the hostname can be resolved.

If verbose is turned on, you may get these messages on your screen:

  Daemon PID 7094
  Connecting to the Mirrord: ('roc', 2123) ...
  Conected.
  Server reply 'OK' for 'START'
  Server reply 'OK' for 'SN:b5e0ce480c925184dbdb6f23a62ddc6d,872302'
  Received next serial message: 'SERIAL:862303'
  WRITE FILE '/var/Counter/data/5597.dat', PREFIX '/data/fs_mirror/roc'
  Received next serial message: 'SERIAL:862306'
  CREATE DIR '/var/www/html/sample231.com/syssite', PREFIX 'data/fs_mirror/roc'
  WRITE FILE '/var/www/html/sample231.com/sysstie/rec.php', PREFIX '/data/fs_mirror/roc'
  DELETE DIR/FILE '/var/www/html/sample231.com/sysstie/rec.php.old', PREFIX '/data/fs_mirror/roc'
  ...

This procedure may take a long time (since the first init of RAID1 may be long too), thus you should avoid reboot the mirrord this version, but one of the future improvements is making the reboot of mirrord be more smooth by computing the modifications in the gap of stopping, or just ignore them with a parameter, thus can be less resource comsuming.

There is a regular file "$datadir/sn" just record the session and the serial number:

  sh# cat /data/fs_mirror/roc/sn
  b5e0ce480c925184dbdb6f23a62ddc6d,872302

It is used when fs_mirror try to restart from the broken point. It is also a file lock (fcntl.LOCK_EX) to avoid you mirror two remote hosts' file system to one directory, or mirror a remote file system twice!

As described in the Overview section, a rotate mechanism is necessary. So far fs_mirror does not have a builtin rotate implementation, which can only be achieved in a future version, but you can write scripts to do so. There are two example script you can take as a reference in the "prototypes" subdir: prototypes/fs_mirror_rotate.py and prototypes/mirror_rotate. These 2 scripts like this:

prototypes/mirror_rotate:

  #!/bin/sh
  
  hosts="p01 2121
  p02 2121
  p03 2121"
  
  cd `dirname $0`
  
  # for host in hosts; do
  echo "$hosts" | while read host port; do
      pid=`ps aux | grep "fs_mirror.*$host" | grep -v 'grep' | awk '{print $2}'`
      if [ $? -eq 0 ]; then
          if [ -n "$pid" ]; then kill $pid; fi \
          && ./fs_mirror_rotate.py $host \
          && /usr/local/bin/fs_mirror -v \
              -o host=$host \
              -o fssync.user=root \
              -o fssync.port=$port \
              --passwd-from=/root/.mirror/secrets:$host \
              --datadir=/data/fs_mirror/$host/
      fi \
      && cat /data/fs_mirror/$host/www/prima/usermap \
          | awk -F, '{printf("%s %s %s %s %s\n", $1, $2, $3, $4, $5)}' \
          | while read perm uid user gid site; do
              chown $uid.$gid /data/fs_mirror/$host/www/users/$site -R
              # gid always be ftpd
          done
  done

prototypes/fs_mirror_rotate.py:

  #!/usr/bin/python
  # -*- encoding: utf-8 -*
  
  import os,sys,shutil
  import time
  import datetime
  
  mirror = "/data/fs_mirror"
  backdir = "/data/hosts"
  try:
      host = sys.argv[1]
  except IndexError:
      print >> sys.stderr, "Lack of a host identity"
      sys.exit(1)
  
  day = datetime.date(*time.localtime()[:3])
  new = os.path.normpath("%s/%s" % (mirror, host))
  NL = len(new)
  old = os.path.normpath("%s/%s/%s" % (backdir, host, str(day)))
  OL = len(old)
  shutil.move(new, old)
  for root, dirs, files in os.walk(old):
      d_new = os.path.normpath("%s/%s" % (new, root[OL:]))
      os.mkdir(d_new)
      # status = os.lstat(root)
      # perm = status[0]
      # os.chmod(d_new, perm)
      # uid = status[4]
      # gid = status[5]
      # os.chown(d_new, uid, gid)
      try:
          os.chmod(d_new, 0755)
      except OSError:
          pass
      # print "CREATE DIR '%s'" % d_new
      for fname in files:
          f_new = os.path.normpath("%s/%s" % (d_new, fname))
          f_old = os.path.normpath("%s/%s" % (root, fname))
          os.link(f_old, f_new)
          # Hard link will reserve the permission and ownership of a file
          try:
              os.chmod(f_new, 0644)
          except OSError:
              pass
          # print "HARD LINK '%s' -> '%s'" % (f_old, f_new)
  
  interval = datetime.timedelta(days=14)
  overdue_day = day - interval
  overdue_dir = os.path.normpath("%s/%s/%s" % (backdir, host, str(overdue_day)))
  try:
      shutil.rmtree(overdue_dir)
  except OSError, (errno, strerr):
      if errno == 2:
          print strerr
      else:
          raise

Example rotated archives are like this:

  [root@stor p01]# ls -l
  total 144
  drwxr-xr-x+  6 root     root     4096 Oct  2 03:10 2007-10-03
  drwxr-xr-x+  6 root     root     4096 Oct  3 03:10 2007-10-04
  drwxr-xr-x+  6 root     root     4096 Oct  4 03:10 2007-10-05
  drwxr-xr-x+  6 root     root     4096 Oct  5 03:10 2007-10-06
  drwxr-xr-x+  6 root     root     4096 Oct  6 03:10 2007-10-07
  drwxr-xr-x+  6 root     root     4096 Oct  7 03:10 2007-10-08
  drwxr-xr-x+  6 root     root     4096 Oct  8 03:10 2007-10-09
  drwxr-xr-x+  6 root     root     4096 Oct  9 03:10 2007-10-10
  drwxr-xr-x+  6 root     root     4096 Oct 10 03:10 2007-10-11
  drwxr-xr-x+  6 root     root     4096 Oct 11 03:10 2007-10-12
  drwxr-xr-x+  6 root     root     4096 Oct 12 03:10 2007-10-13
  drwxr-xr-x+  6 root     root     4096 Oct 13 03:10 2007-10-14
  drwxr-xr-x+  6 root     root     4096 Oct 14 03:10 2007-10-15
  drwxr-xr-x+  6 root     root     4096 Oct 15 03:10 2007-10-16
  [root@stor p01]# ls 2007-10-16/etc/httpd/conf.d -l
  total 112
  -rw-r--r--  9 root root    58 Oct  8 13:41 bw_mod.conf
  -rw-r--r--  9 root root   187 Oct  8 13:41 mod_caucho.conf
  -rw-r--r--  9 root root  2965 Oct  8 13:41 site.conf
  -rw-r--r--  9 root root 10919 Oct  8 13:41 ssl.conf
  -rw-r--r--  9 root root 85876 Oct  8 13:41 virtual.conf

the hardlink number is 8, that means the last time mirrord booted is 2007-10-08.

Notice that fs_sync will do unlink first before a regular file transport , to avoid polluting the rotated backups, since the actual thing the rotate scripts doing is simply creating dirs and making regular files hardlinks.

So far, only the modification actions with the corresponding pathnames are transmitted, so if the dirs/files mirrored have ownership/permission requirements, current mirrord/fs_mirror can do no help itself, since you know to change the ownerships and permissions requires a traverse of the whole filesystem, thus is very resource and time consuming. So I modified the rotate script to change the ownerships and permissions daily, although it is not a good idea, the more powerful function can only be achieved in a future version.

Since it is necessary to keep consistency of the ownerships and permissions, it is sensible to use a centralized authentication solution such as NIS, LDAP or Kerberos.

I must say that so far the fs_mirror client are not fully tested, becuase of my superficial experience about design and programming, and my limited time usually I have to squash spare time to push it forward, so maybe you will have to pay more time to adjust fs_mirror to work correctly.


Backup System for HA

In the examples of Overview section's rsync part, backup for high avialability was described. With mirrord/fs_mirror , just replace the "rsync" to be "mirrord", like this:

  This graph should be displayed with monospaced fonts:
  
      +----------+  
      |  worker  | -[mirrord] -----------\
      +----------+                       |
         ......                          |
                                         |
      +----------+                       |
      |  worker  | -[mirrord] -----------\
      +----------+                       |
                                         V
                                    [fs_mirror]
                                         |
      +----------+                  +----------+
      |  worker  | -[mirrord] --->  |  backup  |
      +----------+                  +----------+
           |                             |
      [take_over]                        |
           |                             |
           V                             |
      +----------+                       |
      |  rescue  | <------------------- NFS
      +----------+

this is the multi to one backup, which is cost efficient. If one of the worker hosts fails, you can subsitute the failed worker with the rescue host, with the aid of any high available method, such as heartbeat project of http://www.linux-ha.org/.

To make the "fs_mirror" on the backup host mirror the file system of several worker hosts, you should run fs_mirror like this now:

  client# fs_mirror -v -o host=worker01 -o fssync.user=mirror -o fssync.port=21 \
  	--password-from=/root/.fs_mirror/secrets:worker01 --datadir=/data/fs_mirror/worker01/
  client# fs_mirror -v -o host=worker02 -o fssync.user=mirror -o fssync.port=21 \
  	--password-from=/root/.fs_mirror/secrets:worker02 --datadir=/data/fs_mirror/worker02/

Make sure the hostnames can be resolved and the username "mirror" exists on all the server hosts with read permission of the whole file system to be mirrored (recommanded to be achieved by Linux ACL).

As described in the previous subsection, by this way, you can put the differerent hosts' mirrored file system contents into different directories, then make several appropriate symlinks to the corresponding subdirs of the mirrored part when taking over.

Run as an IDS

An unexpected good usage of mirrord/fs_mirror is making them to be an IDS(Intrusion Detection System) with the "REPORT" fs_sync mechanism, which can be considered as a realtime tripware. On the server side , put the system part that not varies frequently under the monitor of mirrord(inotify), such as all the executable system commands, and configuration files, and on the client side, run fs_mirror in "REPORT" mode, then any critical modification on the server can be reported in "realtime" as soon as possible.

It seems a good idea to implement a GUI application to do this IDS client things, thus it can be run as a daemon on your PCs, for example on a Windows system, then it can report the critical changes of the servers quickly.

So far this usage has not be thought profoundly, some important features such as SSL support has not been implemented, but I can make sure it can be used in this way.

Acknowlegment

Flaboy #王磊, he is the first person propose to use FAM on Linux for near realtime file system mirroring, and conceived the first story. With his suggestion, I searched the Internet for a whole evening and found inotify at last.

Alex #老徐(徐唤春), he suggest me to use a log mechanism to record the modifications of the file system, which can make a reference of MySQL's replication. And we discussed some aspect of the protocol should be used between mirrord and fs_mirror.

Sebastien Martini, I uses his pyinotify module, and his feedback is very valualbe to me.

Control Center

Online System Adjustment: Migration and Upgrade

Cluster, Distruibuted System and Virtual Hosts

Some Other Subsystems and Utilities

New Linux Distribution with Templates of Solutions

Unit Test for every Subsystem?

Connect the Archtecture Units All Around the World

Comercial Mode?