GO TO:
NEW BLOG
The project Page is: [https://sourceforge.net/projects/crablfs/]
and the svn repository is: [https://crablfs.svn.sourceforge.net/svnroot/crablfs]
The original purpose of this project is a Package Manager for LFS distribution (The User Based Package Management System), which makes use of Linux's basic user/group mechanism to distinguish different packages(maybe I will add LD_PRELOAD mechanism later), and can rebuild customed LFS time and time again automatically by compiling the source code.
Furthor more I found this is a problem of reusability and managiability, especially when hundreds to thounsands machines to be managed. There are many duplications in the process of System Administration, so I changed the direction of the project.
Thus now there are 3 sub projects of this project: caxes, cutils and ulfs. you can find them in the subdirs of the SVN repository.
caxes is the base of the other sub projects. It includes several
libraries and tools to improve the resuablity of system administration and
network management, thus promote the level of automation.
One aspect of resuablity is configuration syntax reuse, thus caxes contains several modules to parse tree/table/sequence from plain text files, xml files or LDAP/SQL databases.
To complete this more graceful, several new data structures definitions in python is being made, include Tree and Table. So far, the most basic structrue of Tree has been created, you can find the document in the code of tree.py and its unittest test_tree.py, or look at here:
Another aspect is configuration sharing mechanism among dozens to hundreds
of hosts, and makes the rebuilding of a system more convenient and more
automatic. A tool named ctemplates will be created to achieve this, which
makes use of the templates of configurations.
cutils is a set of tools depend on caxes, to finish many system
administration tasks and daily Linux use, includes file system backup,
protocols independent file transport, gmail backup, etc...
Currently I'm working on the fs_backup and mirrord/fs_mirror of cutils,
fs_backup is a tool to backup the file system with identity to improve the
manageability of backuping, and mirrord/fs_mirror is a agent/client
application to synchronize the file systems of more than 2 hosts
near realtime with the aids of inotify in the Linux kernel, thus you can
make use of it to build a low cost hot backup(near line) system. For example,
a group of 7 hosts, 5 hosts are online system with mirrord, 1 host is the hot
backup system with fs_mirror, and 1 host is the substitutor(resue host) when
one of the 5 online hosts crashes.
And with the aids of fs_info module also included by cutils, you can just
backup/mirror the specified parts of the file system that contain key data,
and exclude some subdirs of those parts, because fs_info afford this
functionality.
For more details, please just jump to the Chapter:
And its Chinese translation.
While ulfs sub project, is the Package Management System mentioned above. Since
the compiling and installation commands will be recorded automatically by it,
those profiles can be used as configuration by caxes, and shared by
ctemplates, to make the building process more reusable not only to
yourself, but also to anyone need it.
This document can be considered as a preliminary design, and can also be
extended to a User Manual at last. This document source is also in the svn
reposity 'doc' subdir, which is written in 'txt2tags'. The design details
docs are in program source code(Python __doc__) and unit tests.
This document is mainly about reusability of computer systems, especially for a large amount of computers. But I will only talk about those aspects that affect reusability, and only introduce a basic concept of a basic Architecture for system administration, the other aspects of Architecture mostly have existed solutions and out of the scope of this document.
So far, there is only the English version of this document, I will write a Chinese version when I has time.
And there are a document about some of my system administration notes and experiences in Chinese at:
If I has time, I will translate it to English.
Thanks.
Before we talk about the reusability of the system envrionment, we must play eyes on the system architecture. So this system is not a single computer system, it is a system with dozens to hundreds of hosts in its environment.
Then what is architecture? Why?
Nowdays' services become more and more complicated, usually with many machines, sometimes cluster and distributed system is included, and more and more types of service are supplied, more and more tools and applications are invented everyday... Then, how to manager all of these things?
A compact system is not only a accumulation of hardwares and hosts, so we need a center to control the whole system, just like a head. it seems that it's impossible to complete all the manager tasks on only one machine... it's not a good idea even if you can do that, because at last it will become impossible.
So we must split the tasks to make them be finished by different hosts, this hosts at last shape a relatively small network for administration, which is the infrastructure of the whole system, just like the head with left and right brains, eyes, ears...
These different tasks systems can be called as subsystems. We can classify them like this:
** this graph should be shown with monospaced font **
<--- VPN implementation --->
+----------------------+
| security system |
+----------------------+
+----------------------+ +-----------------+
| authentication | | document center |
| && authorization | +-----------------+
+-----------+ +----------------------+
| service 1 |
+-----------+ +----------------------+
| monitor && |
+-----------+ | logs analysis |
| service 2 | +----------------------+
+-----------+
... ########################
+-----------+ # configuration && #\
| cluster 1 | # control center # \------------\
+-----------+ ######################## |
| |
| |
V V
######################## ####################
# storage & backup sys #-----# revision storage #
######################## | ####################
| |
| | %%%%%%%%%%%%%%%%%%%%
| \ % packages %
| \-% collection %
V %%%%%%%%%%%%%%%%%%%%
+----------------------+
| backup of backups |
| on Mars |
+----------------------+
In this graph, the subsystems on the center and right is the infrastructure and those enclosed with '#' or '%' is the important points for improving reusability, which I will discuss in the future. The left components are part of superstructure, which is out of my discussing range.
This graph is not a details graph, in fact, every infrastructure subsystem has logical connections with other subsystems and superstructure, for example the configuration center needs backup and revision control, and package management(the purpose of packages collection), the document center and monitor also need those! Details are in the next chapters.
Now we can talk about every subsystem with several segements:
There is a very import rule in development: DRY(Don't Repeart Yourself), it is easy to achieve this because you just need to think about one application as a programmer, But as a system administrator, there are too many data duplications between various applications and hosts have to be taken into consideration.
To describe the data duplications, I must explain my conception of the data in computer. I think there are 3 types of data:
The data of software environment is mainly packages of applications(and the documents), this is the most static data of the computer.
The data of configuration is mostly the settings in configuration files, sometimes these settings are in databases. It's more dynamic than software environment, but more static than runtime data. And a very important characteristic of configuration is readability, it's a type of User Interface (UI).
So the code(software environment) decides what the program can do, while the configuration decides what the program should do. The runtime data is the object the program process on.
The runtime data is mostly in ralational databases, it's totaly dynamic, and direct readability is not necessary -- there are tools to access those data.
There are many Linux or BSD distributions, every have its own binary packages manager. To avoid this type of duplication, we'd better choose one of them for our architecture environment to reduce the cost.
But the binary packages have vital limitations: they are platform dependent, and the charateristics are fixed which can't be changed, for example, the RPM mutt of Fedora has no function for libesmtp, if I need this, I still have to compile libesmtp and mutt sources to add this feature -- If there are 100 hosts need this modification, how to? Especially this 100 hosts contains several platforms, such as x86, amd64, ppc etc. Or there is several OS or distributions: Debain/Ubuntu, RedHat/Fodora, Gentoo, LFS, FreeBSD ... After all, in this individual times, it's not a wise idea to restrict people's ways and customs of using computer.
So as you can see, this leads to duplications.
Red Hat/Fedora of course has SRPM, it's also a type of source compiling control, thus why not build the whole system from source directly, in fact this is the spirit of UNIX and Open Source.
There are several source package manager, such as the simple paco, complicated RedHat/Fedora's SRPM or FreeBSD/Gentoo's port/portage system.
The port system maybe a good idea, but ultimately the user's customizations are necessary, so how to make this customizations be reusable?
RedHat/Fedora SRPM or Debian/Ubuntu dpkg-source are also good choices, but I would rather to use the package manager I have written, especially for the needs of LFS(Linux From Scratch). I will talk about all of them in the later chapters, and the installation templates based on them for reusability.
There are two types of configuration duplications. One is duplications between different applications, sometimes the duplications even appear in different setting part of one application.
For example, many applications have to set some ip address or hostname info, you have to type these ip address in different configuration files in different formats. When the ip changes, you have to change all the different files, and there is no guarantee of that you will never forget something.
You put /dev/input/mice setting into /etc/X11/xorg.conf for Xorg mouse, you then still have to put the same value for gpm. So Change one will not automatically change the other.
...
I think the Apache's httpd.conf is also a good example:
You have defined
DocumentRoot /var/www/html
, but you still have to write
<Directory /var/www/html>
rather than
<Directory $DocumentRoot>
If you set apache's log as logs/access_log, you also have to manually set
/etc/logrotate.d/httpd, adding the path name of the log file. The same,
change one will not make the other being changed automatically.
Writing scripts to do the changing automatically? Maybe. But the formats are so different, you have to write a lot of scripts, then to maintain the consistent of all these scripts generate duplications which becomes nightmare.
Another configuration duplications are between hosts. You always have to login to every host to modify the configurations.
Maybe we can make use of configuration sharing, but so far I have not seen a good solution that can satisfied me, especially for the single point failure of the configuration sharing. Another problem of configuration sharing is that most config are same, but small part are not, this may make the whole files can't be shared because there no import or include machanism.
I met this situation: two departments maintain two systems, the settings of this two systems in databases are similar, but formats are different, and somethings are different. Ignore the political factors, the inconsistent of the systems while updating makes conflicts between departments.
Text is good for configuration, but sometimes edit a text file is not so convenient, because your eyes or hands have to pay time for the searching in the context.
Yes, many settings are simple to modified, but if there is hundreds to thousands things have to be changed in these conditions, it's not simple.
More examples of services configuration duplications ...
More examples of hosts configuration duplications ...
The previous duplications can be considered as space duplications, there also a time duplication: you edited the configuration file, and several days or weeks latter, you may forget what you have modified. If there are several administrators, how to know who does what modifications?
These can be solved by Revision Control, but before builting the Configuration Sharing Mechanism for the previous duplication problems, it's not easy to achieve this goal.
Other duplications of the systems mostly come from the NOT WELL FORMED architecture, or NOT GOOD utilities.
For example, the lack of authentication and authorization mechanism leads to the mass of auth mechanisms for different services, and a member maybe have to remember several accounts(more for system administrators); of course you can build same account for one in different services, but it's a nightmare to modify the passwords or accounts information for all the services and hosts.
If there is an aa center, it is very easy to disable an account, just LOCK it in the authentication center(as 'usermod -L' in a single system), then the others' authorization problems can be ignored.
Someone take copy or replication as backup, I think that's wrong. If the backup tools have not considered the restore/recovery process, and has no good manage organization, you will have write many unmanagable scripts for backup and do many manual manipulation while restore.
We talked about documents, by that way, we can reduce most duplications in documentary manager, just choose a right markup language to write the documents with aids of revision control, and make the formats converting automatically -- there are many tools to do these.
A system administrator may have to pay too much attentin on many details and copies in a low reusable system environment. There is a proverb: devil is in details, I also think devil is in copies. Because the copying is mostly the responsibility of machines, not human's. If the copying has to be achieved manually, it's a nightmare.
To make such system architecture as a compact system, building all the subsystems that mentioned before becomes a tool chain problem, the key is circular dependencies among different subsystems, thus they have close connections, just like building a LFS system, all the packages of which are compiled from source. But the building process becomes complicated, it's low efficiency to build it manually every time, that the purpose of reusability.
While maintaining, there also many duplications which we have discussed, to sovle those problems, reusability is the key.
So to make this big system runs effectively, the reusability becomes a very important mark and fundation.
The reusability will also imporve the orthogonality of the whole system. By design (1) the new syntax(mini language) of configuration and the corresponding APIs, and (2) the new configuration sharing mechanism with the aids of revision control.
After building an architecture, with the aids of reusable mechanism, all the relative things will be recorded, include packages information and configuration settings. All of these become an template.
Other users can build new architectures based on this templates, adjust for their own needs, then generate new templates. These new templates can also been used by others ...
So at last, we can build a new Linux distribution with templates of solutions, or improve the efficency of the existing distributions. And get more and more choices more easily ... Maybe a whole new world?
So what in my mind is the fundation of the all that infrastructure of an system architecture.
Data driven is a good mode for design, usually the code decides what a program can do, and the configuration decides what a program to do. So now our problems become what is the more logical data structure for the settings of a program?
I have said that there are hundrends types of configuration formats for maybe thounds of appclications, which leads to duplications among different services and hosts. But we can still find the similarities of them. In my opinion, there are 3 types of configuration format: Tree key-value, Table and Sequence List.
Let's have a look at them:
my_hdr From: chowroc.z@gmail.com set edit_headers=yes subscribe set pop_user="chowroc.z@gmail.com" set pop_pass=******** set pop_host="pops://pop.gmail.com:995" set smtp_host="smtp.gmail.com" set smtp_port="587" set smtp_auth_username="chowroc.z@gmail.com" set smtp_auth_password="********" set smtp_use_tls_if_avial=yes ......
xferlog_enable=YES xferlog_file=/var/log/vsftpd.log local_umask=002 user_config_dir=/etc/vsftpd/users chroot_local_user=YES userlist_deny=NO ssl_enable=YES force_local_logins_ssl=YES ......
fs.quota.syncs = 14 kernel.printk_ratelimit = 5 kernel.modprobe = /sbin/modprobe kernel.panic = 0 vm.max_map_count = 65536 vm.swappiness = 60 net.ipv4.tcp_syncookies = 0 net.ipv4.icmp_echo_ignore_broadcasts = 1 net.ipv6.route.max_size = 4096 dev.scsi.logging_level = 0 ......
myhostname = mail.sample.com mydomain = sample.com myorigin = $mydomain mynetworks = 127.0.0.0/8 192.168.0.0/24 mydestination = $domain smtpd_sasl_auth_enable = yes smtpd_recipient_restrictions = permit_mynetworks, permit_sasl_authenticated, reject smtpd_client_restrictions = smtp_always_send_ehlo = smtp_connect_timeout = smtp_connection_cache_on_demand = smtp_connection_cache_reuse_limit = smtp_connection_cache_time_limit = smtp_connection_reuse_time_limit = lmtp_sasl_auth_enable = yes local_destination_concurrency_limit = local_destination_recipient_limit = ......The sysctl.conf is the most classical plain text tree format, the postfix main.cf same, just used a different dilimiter between nodes. The simple key-value pair format can also be considered as the special type of plain text tree -- the depth is only 1.
Since the configuration file format is regular, the goal can now be set as: choose the right formats and design the correspoding APIs.
This format design can also be considered as a mini-language design, this mini language only describe the configuration data structure, rather than a procedure -- the interpretation will be performed by the API and the application itself.
As we can see, the tree is the mostly flexible type, which has been applied the most widely. It's very suitable to apply it when the meaning of the configs varies line by line, mostly this type of configuation does not contain too large amount of information.
So my design must center on the tree format, to constitute the trunk of the configuration system(We now only talk about the single computer system for more good understanding). When it is necessary, the branches and leaves can point to tables and lists.
Among several style of tree format, which is the best choice? For example, the very popular XML? Or choose the LDAP for all configuration at all -- it seems simple and straightforward.
But in my opinion, if we use file as configuration, the best format is 'Plain Text Tree' as sysctl.conf, or say Flat Tree/Line Based Tree It's the most clear and intuitionistic style, which has the highest readability. While the readability of XML is more lower with the <tags> anywhere in the file.
Only choose LDAP make the whole system too fixed, the flexiblity and readability are too bad. The configuration will be too centralized which leads to 'single point failure' easily.
The plain text tree is very easy to be processed by the tranditional UNIX tools such as grep, sed, awk, ...etc. Since the goal is the design of the APIs, then when I import the module, the parameters that I will transport to the interface need to be as simple as possible for better human memerios, the line based string is a good idea, rather than XML/LDAP description blocks. For example:
from caxes.ctree import CTree
cfile = '/etc/fs_backup.conf'
cto = CTree(...)
config = cto.get(cfile, 'backup.datadir')
...
config = {}
config = cto.get(cfile, 'restore.*')
...
config = cto.get(cfile, 'upm.install.commands.*')
...
For more better managability, it's a good idea to make use of revision control for the configurations, the XML will lower the managability because it's not line based, and I don't know how to make of of revision control on LDAP. So the plain text tree is the first choice on this field.
Generally speaking, the plain text tree is the most simple format to end users, after all, configuration is not document.
But these line based string can also be converted to XML/LDAP blocks by some appropriate tools or APIs, and there may be somebody more prefer XML than this line string style, so we can also design those corresponding APIs, and make the converting/translation among all these formats easily and automatically.
By this way, the information can be stored in plain text tree, XML or LDAP or even any other storages, but the APIs to the end users are consistent, the user only need transport the simple line string to the methods of the APIs.
Tree cannot describe the whole world, so makes the leaf nodes point to tables and lists, for example, a plain text tree file can be:
backup.records = 'ctable:///var/fs_backup/.table' backup.ids = 'clist:///var/fs_backup/id_table'
Then the API knows to retrive information from the given table and list, and make them easy to be used by the program calls it.
When use table as configuration, text style is the first choice, but the table format should has a little differences than the classical table files such as /etc/passwd or CSV. The table format should has some 'self-explanatory' characteristics, such as the fields' names and the seperator, so the APIs don't need to set that.
It's better if the API can also access the text table using standard SQL, there are some standard documents for this type of API, for example, Python has: Python Database API Specification v2.0 Thus we don't need to design a new mini-language for the table format.
Then we can also make the real DBs as the auxiliaries, just as LDAP for plain text tree.
The list is similar with table, and more simpler.
There can be a view of format structure organization like this:
** this graph should be shown with monospaced font **
CTree(Ftree(Plain Text Tree), XTree(XML), LTree(LDAP))
|
|\--CTable(FTable(Text Table), DBTable(real DBs))
|
|\--CList(Flist, ...)
You can also have a look at the Big Picture for more details.
The previous section has discussed some views of the APIs and mini language, now let's go further.
A basic format of plain text tree is like this:
fs_backup.format = "bid, btime, archive, type, list, host"; fs_backup.type.default = "full";
The corresponding API should like this:
from caxes.ctree import CTree
cfile = open('file:///etc/fs_backup.conf', 'r')
ctr = CTree()
config = ctr.get(cfile, optmap=['fs_backup.format'])
print config
# {'fs_backup.format' : 'bid, btime, archive, type, list, host'}
config = ctr.get(cfile, prefix='fs_backup', optmap=['format', 'type.default'])
print config
# {'format' : 'bid, btime, archive, type, list, host', 'type.default' : 'full'}
But it's better to define a Tree class to make the operation more intuitionistic:
class Tree:
......
# now CTree.get() will return a Tree instance, so
config = ctr.get(cfile, optmap=['fs_backup.format'])
dir(config)
# [...,
# _Tree__node_value, _Tree__node_items,
# __setattr__, __setitem__, __getitem__, __add__, __iadd__
# __traverse__, __update__, _one_node_set,
# fs_backup,
# ...]
print config.__prefix__
# ''
repr(config.fs_backup)
# <Tree instance at ...>
dir(config.fs_backup)
# [..., _Tree__node_value, _Tree__node_items, __str__, format, type, ...]
print config.type()
# 'increment'
print config.fs_backup.type.__dict__
# { 'default' : <Tree instance at ...>, ...}
print config.fs_backup.type.default()
# 'full'
config = ctr.get(cfile, prefix='fs_backup', optmap=['format', 'type'])
print config.__dict__
# { 'format' : <Tree instance at ...>, 'type' : <Tree instance at ...>, ...}
print config.format()
# 'bid, btime, archive, type, list, host'
print config.type()
# 'increment'
print config.type.default()
# 'full'
In fact this can create a wholly new data structure: Tree, in python, to act as a built type. The detail of Tree design can be found at the section: New Tree data structre.
Futhor more, the Tree structure can be used directly by the python programs and scripts as module, thus the configurations can be written as a python module rather than a new mini-language being described here, which can furthest eliminate the text parsing process.
The tree configuration also must support the normal comment style: '#'
fs_backup.default = "full"; # comment # more comment
so do multi lines mode, for example:
test.long.value = """The value of the options also support multilines mode""";
There should be vairable support:
fs_backup.type.default = "full"; # comment
fs_backup.type = ${fs_backup.type.default};
# fs_backup.type = "incr";
fs_backup.datadir = "/var/task/fs_backup";
fs_backup.tagfile = ctable:"$datadir/.tag";
# Or use absolute path:
# fs_backup.tagfile = ctable:"${fs_backup.datadir}/.tag";
Sometimes wildcard will be very convenient:
config = ctr.get(cfile, optmap=['fa_backup.*'])
# as a dict:
print config
# {'fs_backup.format' : '...', 'fs_backup.type.default' : 'full', 'fs_backup.type' : 'incr', 'fs_backup.datadir' : '/var/task/fs_backup'}
# or as a Tree instance:
dir(config.fs_backup)
# [..., __str__, format, type, dataidr, tagfile, ...]
Further more, the regular expression support can be added too.
Since we have talked about the format structure and organization, the support for cross file parsing should also be added.
fs_backup.others1 = "ftree:///etc/others1.conf"; fs_backup.others2 = "ldap://hostname/..."; fs_backup.others3 = "xml:///etc/others2.xml"; fs_backup.others4 = "ftree:///etc/others4.conf/root.branch1.leaf2"
The document of "Tree" is in the code and unit tests, which can be check out from the svn repository: https://crablfs.svn.sourceforge.net/svnroot/crablfs/caxes/trunk/lib/ The files are tree.py and tree_ut.py
SELECT * FROM __CURRENT__ WHERE ...
class textdb:
For Tree: SELECT d2.value FROM foo.d1
** this graph should be shown with monospaced font **
[meta-conf]
|
V
[subversion]
|
(sandbox)
|
/--------/--------/ \------\
| | | |
V V V V
{API} {API} {API} {API}---\
| | | |
V V V V
%task1% %task2% %task3% %cluster%
|
.......
data flow view
** this graph should be shown with monospaced font **
/text/
[service]
{API}
(directory)
%program%
<=manual-operation=
$DB$
/<---------%convert%--------->/ftree/ <=modify=
| |
[LDAP]------|<--%convert%-->/xml/<=modify= |
| | |
%replicate% | |
| \--->[SVN]<---/
V |
[LDAP] |
| v
| /---(sandbox)---\
| | |
{LTree} V V
| /xml/ /ftree/~~~point~~~~~~~~~>$MySQL$
| | | | |
| V V \~~~/ftable/ |
| {XTree} {FTree} | |
| | | | |
\------------------------> \--->{CTree}<---/ {FTable} {DBTable}
| | |
V | |
%application% <---{CTable}-------------/
We talked about improving reusability via configuration syntax design in the previous chapter, that's mainly about reuse among different services, especially for those in one single computer system. But with the aids of some Network File System, and the "variable support", "cross file parsing", "include/import" characteristics, the configuration can be shared among different hosts. As you can see, the "variable support" and "cross file parsing" is line based, so it's a fine grit(or say: high level) control.
But it's not enough for reuse of configuration among different hosts only with the aids of syntax design, because:
So the Configuration Sharing Mechanism is about files sharing, or say, "based on files sharing".
It's not so easy to explain this configuration sharing mechanism, but let's have a try:
A very simple and direct train of thought is build a network file system, such as NFS/Samba, put the configuration files on that and share them to the hosts. But there are several problems:
Yes, if the cluster is used, the reusability among different hosts can be improved remarkable, for example, make use of LVS with GFS, the configuration files can be put into the shared storage(GFS) and all the real servers be same that can read the shared configuration files from the only one place.
But the real world is always more complex, only cluster can't eliminate the problems memtioned above. Especially the second and the third. Furthor more, this sharing is only oriented to services configuration files, not the whole hosts, and there are some other not so important issues can be considered:
Now let me explain the principals of this configuration sharing mechanism. Of course, there should be a shared storage, and we also need the revision control for configuration files, we can choose subversion. The big picture like this:
** this graph should be shown with monospaced fonts **
+-------------+
/ a real host /
+-------------+
A
| +--------------+
| / shared files /
implement +--------------+
| unfold __/ \__
| __/ \__ commit
+-----|------+ __/ \__ +------------+
/ real files /<--/ \-->/ subversion /
+------------+ +------------+
Let me explain the structure of the shared files at first. I think it's necessary to make every service as a logical host:
sh$ ls templates/* templates/base templates/httpd templates/vsftpd ... templates/webserver1 -> LAMP templates/ftpserver1 -> vsftpd ... templates/LAMP templates/docs templates/monitor templates/logfilter ... sh$ ls templates/httpd/* templates/httpd/.inherits templates/httpd/etc/ld.so.conf -> ../../base/etc/ld.so.conf templates/httpd/etc/skel -> ../../base/etc/skel templates/httpd/etc/rc.d -> ../../base/etc/rc.d templates/httpd/etc/snmp -> ../../base/etc/snmp templates/httpd/etc/logrotate.d/httpd templates/httpd/hosts.deny templates/httpd/etc/httpd ... sh$ ls templates/LAMP/* templates/LAMP/.inherits templates/LAMP/etc/ld.so.conf -> ../../httpd/etc/ld.so.conf templates/LAMP/etc/skel -> ../../httpd/etc/skel templates/LAMP/etc/rc.d -> ../../httpd/etc/rc.d templates/LAMP/etc/snmp -> ../../httpd/etc/snmp templates/LAMP/etc/httpd -> ../../httpd/etc/httpd templates/LAMP/etc/vsftpd -> ../../vsftpd/etc/vsftpd templates/LAMP/etc/my.cnf -> ../../MySQL/etc/my.cnf templates/LAMP/usr/local/lib/php.ini -> ../../../../httpd/usr/local/lib/php.ini ... sh$ ls templates/docs/* templates/docs/.inherits templates/docs/etc/ld.so.conf -> ../../LAMP/etc/ld.so.conf templates/docs/etc/skel -> ../../LAMP/etc/rc.d templates/docs/etc/snmp -> ../../LAMP/etc/snmp templates/docs/etc/httpd templates/docs/etc/httpd/conf -> ../../../LAMP/etc/httpd/conf templates/docs/etc/httpd/conf.d/sites.conf templates/docs/etc/httpd/conf.d/svn.conf templates/docs/etc/vsftpd -> ../../LAMP/etc/vsftpd templates/docs/etc/my.cnf -> ../../LAMP/etc/my.cnf templates/docs/usr/local/lib/php.ini -> ../../../../LAMP/usr/local/lib/php.ini templates/docs/usr/share/vim/vim70/syntax/txt2tags.vim templates/docs/usr/share/vim/vim70/filetype.vim ... sh$ ls templates/monitor/* templates/monitor/.inherits templates/monitor/etc/ld.so.conf -> ../../LAMP/etc/ld.so.conf templates/monitor/etc/skel -> ../../LAMP/etc/rc.d templates/monitor/etc/snmp -> ../../LAMP/etc/snmp templates/monitor/etc/httpd templates/monitor/etc/httpd/conf -> ../../../LAMP/etc/httpd/conf templates/monitor/etc/httpd/conf.d/virtuals.conf templates/monitor/etc/vsftpd -> ../../LAMP/etc/vsftpd templates/monitor/etc/my.cnf -> ../../LAMP/etc/my.cnf templates/monitor/usr/local/lib/php.ini -> ../../../../LAMP/usr/local/lib/php.ini templates/monitor/etc/rrdtool templates/monitor/var/htdocs/cacti templates/monitor/etc/postfix ... sh$ ls hosts/* hosts/www1 -> ../../webserver1 hosts/www2 sh$ ls hosts/www2/* hosts/www2/.inherits hosts/www2/etc/ld.so.conf -> ../../../../templates/docs/etc/ld.so.conf hosts/www2/etc/skel -> ../../../../templates/docs/etc/rc.d hosts/www2/etc/snmp -> ../../../../templates/docs/etc/snmp hosts/www2/etc/httpd hosts/www2/etc/httpd/conf -> ../../../../../templates/docs/etc/httpd/conf hosts/www2/etc/httpd/conf.d hosts/www2/etc/httpd/conf.d/sites.conf -> ../../../../../../templates/docs/etc/httpd/conf.d/sites.conf hosts/www2/etc/httpd/conf.d/svn.conf -> ../../../../../../templates/docs/etc/httpd/conf.d/svn.conf hosts/www2/etc/httpd/conf.d/virtuals.conf -> ../../../../../../templates/monitor/etc/httpd/conf.d/virtuals.conf hosts/www2/etc/vsftpd -> ../../../../templates/docs/etc/vsftpd hosts/www2/etc/my.cnf -> ../../../../templates/docs/etc/my.cnf hosts/www2/usr/local/lib/php.ini -> ../../../../../../templates/docsusr/local/lib/php.ini hosts/www2/etc/rrdtool -> ../../../../templates/monitor/etc/rrdtool hosts/www2/var/htdocs/cacti -> ../../../../../templates/monitor/var/htdocs/cacti hosts/www2/etc/postfix -> ../../../../templates/monitor/etc/postfix ... # be carefull that it also links to 'monitor', not only 'docs'
As you can see, most of them are symlinks rather than be simply shared. By this way, we can still get all benefits of simple sharing, and exert more precise controls.
All of this configuration files can be commited to the subversion, thus now we have the revision control.
The unfold process on the previous graph is copy out the symlinks to be real files, the core concept of it is very simple, for example:
sh$ ct_unfold hosts/www2 # will actually wrap this command: # /bin/cp -Lrf hosts/www2/* /mnt/realfiles/www2/ # and: sh$ mount ... //www2/homes on /mnt/realfiles/www2 type cifs (rw,mand)
I think it's better to make the unfold process finishes the actual files transport between the configuration center and the real host, by one of push and pull way. The previous example use pushing method.
While the implement process copies the real files to the relative real host's right local locations.
According to the previous graph and the unfold principles, we can say it's totally orthogonal: you can modify and commit the configuration files without regard for the influences to the real host; only when you perform the 'unfold', the modifications take effect; and you know who does the modification. If the subversion or shared storage is down, the real host can still run normally.
Of course, if there are only these be offered, it offers no manageability! Maintain all the symlinks manullay is not less than a nightmare! so we need some utilities. Let me explain this:
At the first, we have a base system template, templates/base, we need a
tool to 'inherit' from it:
sh$ ct_inherit -t templates/base templates/httpd
it should create the right symlink.
As you can see, the OO thoughts have been applied here. So we can take all the subdirs of 'templates' as class definitions, they are super class templates or sub class templates, and all the subdirs of 'hosts' can be considered as instances.
Then we need a tool to 'custom' some settings of 'httpd' which makes it a httpd server rather than a base system:
sh$ ct_custom vi templates/httpd/etc/logrotate.d/httpd
it will remove the symlink of 'templates/httpd' -> 'templates/base' that created before, and create a real dir of 'template/httpd', create symlinks of 'templates/httpd/*' -> 'templates/base/*' except 'templates/httpd/etc' subdir, then 'mkdir templates/httpd/etc' and 'ln templates/base/etc/* -s templates/httpd/etc/*' except 'templates/httpd/etc/logroate.d', ... At last add a new file 'templates/httpd/etc/logroate.d/httpd' and open it for editing.
It also should record a file 'templates/httpd/.inherits' that indicate that it inherits from 'templates/base'.
If the target file exists, it should report that and ask whether to continue:
sh$ ct_custom vi templates/httpd/etc/hosts.deny This target inherits from 'templates/base' do you want to cut off the relationship, or track upwards?(no/yes/track)
if 'yes', it should copy the symlink to a real file, and open it for editing.
If 'track' is chosen, it will follow the symlink to its super class template, and do this recursively(so ask for cut off or track upwards again if it is a symlink).
Of course there should be a '--force' option to skip the prompt:
sh$ ct_custom --force vi templates/httpd/etc/hosts.deny
or
sh$ ct_custom --track-end=base vi templates/httpd/etc/hosts.deny
There are some other operations should can be done, such as:
sh$ ct_custom cp -r /tmp/httpd/* templates/httpd/etc/httpd/
the last argument should always be the target!
In the previous structure examples, the host instace 'www2' does not only have the symlinks to 'templates/docs', but also has symlinks to 'templates/monitor'. Thus we can say it multi inherits from 'templates/docs' and 'templates/monitor':
sh$ ct_inherit -t templates/docs -t templates/monitor hosts/www2
in this condition, the 'hosts/www2' should look for the files to symlink to 'templates/docs' at first, and then to 'templates/monitor', the overlapped files should be recorded to 'hosts/www2/.inherits' and obey the first otherwise there are files conflicts: the 'same file' of 2 templates have different contents , it should prompt to the user to solve the problem manully.
By this way, 2 templates can be merged to 1, but you may want to merge 2 to 1 without inherit(copy the contents), maybe I can add a program ct_merge to do so, and ct_inherit make use of it.
If a super class template has been modified, all the sub class and instance templates that inherit from it should be notified to the administrator who does the modification.
sh$ ct_custom vi templates/templates/LAMP/etc/httpd/conf/httpd.conf ... These templates and hosts will be impacted by this modification: 1. templates/docs 2. templates/monitor 3. hosts/www2 do you want to continue?(yes/no/exclude) exclude 3(exclude host/www2)
of course there should be a '--force' option.
inheritance tree?
Currently there are only a simple README file in this subproject to describe it, you can also found the hints from LFS hints:
[http://www.linuxfromscratch.org/hints/downloads/files/crablfs.txt]
The basic principals can also be found from LFS hints:
[http://www.linuxfromscratch.org/hints/downloads/files/more_control_and_pkg_man.txt]
A package manager has many functions, here I only talk about how to add the reusability into it, and make use of the reusability. It will also make use of configuration syntax control features to reuse the installation profiles, and configuration sharing mechanism to reuse the installation templates.
Current problems:
Backup always be considered as the most important thing in system administration, and there seems many solutions, both comercial or open source.
Because of my shallow, I have not used one of the comercial backup system so far, the only I know is that they are not cheap, and because my experience of some comercial softwares, I don't think they can fit my needs, and I guess that they may bind you to some specified hardwares.
Let's have a look at the backup of the file system, the backup of database is different and now is not my concerns:
Maybe you will write some scripts to do tar of the dirs/files
cronly, but the archives will lose managability soon, because it's not
easy to classify what have been added to the archives and there may be
many archives for different parts of the file system.
The lack of incremental/differencate backup is also a weekness of
"simple tar", And because tar can never find what has been deleted, the
recovery will be filled with outdated scraps. You can make use of
find to solve this problem by a low effecient way, BUT:
The most important: you will also be stuck with a real product server's file system with more than millions of dirs/files, to uncompress from archives will be very very slow ... In fact, just traverse this type of file system once can consume too much time. Thus the efficiency and convenicence of recovery is not taken into consideration.
So with nowdays fast, large and cheap disks, some synchronization
mechanisms are applied for backup and recovery, such as rsync,
amanda ... And with the hard link ability of Linux, you can get a
whole full backup of every sync point with actually incremental
modifications(or say snap like incremental backup, or say rotate ...).
There is a ruby application named pdumpfs that make use of both
rsync and rotate(pdumpfs-rsync), I used it under FreeBSD for a
period.
With such tools, you can rescue from crash soon with the data at least may be hours ago, for example, if you backup a host like this:
This graph should be displayed with monospaced fonts:
+----------+ +----------+
| worker | --- rsync ---> | rescue |
+----------+ +----------+
You can replace the "worker" host by "rescue" host quickly, just make several symlinks on "rescue" link to the right directories.
A more possible topology is like this, because it's more cost-efficient, and the previous one asks for "couple hot backup" , thus more expensive:
This graph should be displayed with monospaced fonts:
+----------+
| worker | --- rsync -----------\
+----------+ |
...... |
|
+----------+ |
| worker | --- rsync -----------\
+----------+ |
V
+----------+ +----------+
| worker | --- rsync ---> | backup |
+----------+ +----------+
| |
[take_over] |
| |
V |
+----------+ |
| rescue | <------------------ NFS
+----------+
By this way, you get a buffer, thus get more time to recovery the crashed worker host. It's explict this model is low cost, since you can backup and rescue(HA) more than 3 hosts with only 2 hosts (multi to one).
The reason NFS is used is because recovery directly by copying back the files takes too much time thus is impossible, for example, a real product online system attched with SCSI disks, 100 MBit NIC/wires and 50 GB data which is near 2 milions dirs/files(100 thousands dirs) can take at least more than 2 hours to copy the files back (you can build a SNMP/Cacti to get the statistics and compute the time).
Although the SCSI interface is
target0:0:0: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 62)
(dmesg), here MB/s is MBps(MBits/s), the 3 MBytes/s is the actual
expectable upper limit sum of disk read/write to retrive such many
small dirs/files, and 30 Mbits/s expectable upper limit of network
traffic, so disk I/O is the bottle neck.
If you transport a very big file such as several GBytes, the result is very different: more than 11 Mbytes/s for disk I/O and 98 MBits/s for network traffic, thus network becomes the bottle neck, and the performance of disk and network is taken full advantage of.
We must concern the former situation, because that is the real condition we have to face to, to compute the time will be consumed:
sh$ echo "50 * 1024 / 3 / 60 / 60" | bc -l 4.74074074074074074074
That means you have to take near 5 hours to recovery from the backup! Even takes DISK I/O to be 6MBytes/s:
sh$ echo "50 * 1024 / 6 / 60 / 60" | bc -l 2.37037037037037037037
more than 2 hours is needed! On the whole it's unacceptable!
So the key is: ELIMINATE or REDUCE copy and traverse operations when rescue as far as possible
The reason why the different situation between small and big files roots to the basic structure and principles of file system itself. Take Ext3fs as an example, The physical disk is split to many groups, and the metadata of dirs/files scatter on different parts of the disk, so to transport such small files, OS kernel has to get all the metadata of them and makes too much time on disk seaking; while big files has their data on several sequential area on disk.
Maybe you can make use of RAID, I think RAID10 can expect mostly 2 times in speed(you must also take the write on target side into consideration) , thus more than a hour is needed. And this is only the time of copy, after copy, you may need to change the ownership and permission of the files, which need traverse thus is time consuming, and several relative operations may need to be done too ...
I'm not clear about Reiserfs, but I don't expect it can speed up for 5 times than Ext3fs, even with the aid of RAID, I can only expect 6 times in speed, and with the relative other operations, I think 1 hour is needed at least.
But even with the rescue method metioned above, you still lost the data from last time rsync runs, because:
rsync mechanism is not perfect(of course there is nothing perfect on this world), because rsync is not "realtime". Every time rsync is invoked, it must get a list of files to be transport, and to retrive such list, it has to traverse the whole source part of the file system, compare the timestamp of every source and the target file, and then do corresponding operation. Along with the file system grows larger and larger, more and more time, higher and higher system overhead will be consumed for every synchronization, thus you have to increment the interval, however you original mind is to minimize it, since you will lose more files/data with a larger file system and bigger interval. Maybe for a critical service, this loss is not acceptable.
And rsync has another "vanished" problem, that is, between it get the list and do the actual transport, the file system may varied, some new dirs/files may be created/modified, and some dirs/files may be deleted. The former is not a critical problem, but the latter may cause problems. For example, pdumpfs-rsync will terminates when "vanished" problem occurs and leaves the rest of synchronization unfinished! But I think rsync is robust enough and can just ignore this warnings.
No matter what mechanism you use, tar with find or rsync, ..., there is still a challenge: usually you just want to backup the specified parts of the file system, and exclude some dirs/files in those parts, such as cache files generated by the website's php scripts, ...; and these parts identities should can be managed easily, be better to managed by the computer itself after you specified them the first time.
So far, for file system realtime replication/mirror backup, I have not found an open source solutions, so I write this one and hope it will be useful to you. (In fact, just yesterday(18 September 2007), I heard of some realtime solutions, some are commerical such as PeerFs, while some are open source such as DRBD, so far I'm not very clear about them, just has some concepts, so I will do some experiments to learn them, maybe there are better solutions.)
I must say, this C/S application, I named mirrord/fs_mirror, in the
cutils package, is not an absolute realtime backup tool, it's only
"near realtime", that is, in normal condition(normal load average and
disk I/O, enough network bandwidth), the delay between the host runs
mirrord agent and the host runs fs_mirror client is less than 1 second.
The resource it consumes is ignorable. For the details, jump to the
later "Design" sections.
And it's explict that so far mirrord/fs_mirror is only suitable for the file system with small files, rather than big files that vary frequently such as logfile or database. Some other applications can be applied for them.
As a summary, lets look at these aspect about backup and recovery:
fs_info.
The design concepts and cases will be disussed in the next section, if you just what to know how to use mirrord/fs_mirror, please skip it to the section "Usage Manual".
To achieve mirror, the application must know what has been modified in the interval(commonly less than 1 second) from last files transport, as known, rsync has to traverse the whole parts of the file system and compare the timestamps, which is too resource overhead and not realtime.
While mirrord/fs_mirror can avoid traverse and comparison by make use of
a function afford by the recent Linux kernel(from 2.6.12):
inotify.
inotify can monitor any modification of appointed directories, that is, when you create, delete, move, change attributes of dirs/files, or change contents of regular files in those directories (not recursive), inotify will raise an event, the application can catch and enqueue the events, find the specified file's name that leads to this event by the watch(described below), and do corresponding operation(here mirrord will record the action and the relative filename, then tell fs_mirror by transport the records, fs_mirror then do relative operation such as creating, deleting, moving or file transport).
Although inotify supply no recursive ability, the application can recursively add the directories to its watches list, these watches represent those directories have to be monitored by inotify, which is represented as descriptor digits in kernel, thus is not memory consumed. And because it is afforded by kernel, it is lightweight.
For more details about inotify, there are some introduction on Internet:
http://www.linuxjournal.com/article/8478 http://www-128.ibm.com/developerworks/linux/library/l-inotify.html http://blog.printf.net/articles/2006/03/25/lightweight-filesystem-notifications
Or you can read the source file of inotify in Linux
./include/linux/inotify.h ./Documentation/filesystems/inotify.txt ./fs/inotify.c
And there is a Python package pyinotify that mirrord/fs_mirror currently
use: http://pyinotify.sourceforge.net/
A simplest story comes from one of my colleague, that is to make use of hard link to record the modifications of the file system, export this hard link records directory to remote backup host via NFS, and the daemon on backup host periodically moves the dirs/files in the exported record fs to local directories, like this:
This graph should be displayed with monospaced fonts:
(events) \
[inotify] ----> [mirrord] worker host \ rescue/backup host
| |(invoke) \
| | \
__original-fs__ == HardLink ==> __records__ == MoveOnNFS ==> [fs_mirror]
\
Although it is simple, this model is problematical. There are several problems can't be solved by this "only hard link" model:
This graph should be displayed with monospaced fonts:
file1 <=== hardlink === file2
A |
| V
WRITE_CLOSE (mv)
procedure:
| |
| |<-- COPY_COMPLETED
| |
|<-- WRITE_CLOSE |
| |
| |<-- REMOVE
WRITE_CLOSE is an event raised when file writing occurs, if you have
read the documents of inotify, it's not hard to understand this. When
the event raised, the inotify processor do corresponding "hardlink"
operation from file1 to file2, let's assume that the file2 has exist
by the previous WRITE_CLOSE, thus this time's WRITE_CLOSE has to
ignore the "OSError: file exists" exception, so when "COPY", only the
content modified last time has been transported, and "REMOVE" unlinks
the hardlink "file2" just "created a moment ago(maybe several
milliseconds)", without synchoronize the content modified this time,
thus it it is missed. If the file does not change until the catastrophe
of the machine, you totally lose this modification. Only one file is
not a serious problem, but there is no guarentee that there will only
a few files will be affected by this principle. And as we known,
"COPY" itself is not atomic, which will lead to more missed holes in
the target files.
Therefor a more powerful strategy should be chosen for record the modifications of the file system. Since mirrord currently use pyinotify, I suggest you have a look at it.
It is known that a instance of the class inheriates from
"pyinotify.ProcessEvent" can be used to do the corresponding operations
of the events, for example, if there is a method
def process_IN_CLOSE_WRITE(self, event):
the inotify "IN_CLOSE_WRITE" events will be processed by this method in
a loop you programed, for example:
while not self.Terminate: ... if wmNotifier.check_events(wmTimeout): wmNotifier.read_events() ... wmNotifier.process_events() ...wmNotifier.process_events()
Here wmNotifier is the instance of "pyinotify.Notifier" (please refer to the pyinotify document if you are not clear), when wmNotifier.process_events() is called, it invokes the "ProcessEvent" inheriated instance's process_* methods, "IN_DELETE" events corresponds to "process_IN_DELETE", "IN_WRITE_CLOSE" corresponds to "process_IN_WRITE_CLOSE" as memtioned above.
So what have to be done is to code such instance and its process_* methods, we can name the inheriated class "a Processor".
Not all events raised by inotify is important, for file system mirror, there just several events should be take into consideration: IN_CREATE (with IN_ISDIR for only directory) IN_WRITE_CLOSE (only for regular file) IN_DELETE IN_MOVED_FROM IN_MOVED_TO
There is another event "IN_ATTRIB" that is very useful yet but it can only be applied for the next version because of my immature design for mirrord the first time.
What this Processor should do, or say what the process_* should be? Let's play the eyes on another problem at first: what should be watched.
If you have read the documents of inotify and pyinotify, you know that only the dirs/files that have been watched by inotify will raise the events, and it is your program's responsibility to add the dirs/files to watches. So you must clarify what to be added to the inotify watches by call inotify_add_watch() of inotify or WatchManager.add_watch() of pyinotify.
To clarify what should be added to the watches, or say, be monitored by the inotify, is a manageability problem mentioned before. A first rule we should know is: Only the directories need to be monitored. Because any modification on the subdirs/subfiles in those directories can make inotify raise the events, and you can get the monitored dir's path name and the subdir/subfile's name by pyinotify's "Event" instance that relative to the event. In fact pyinotify only monitor directories.
There is a recursive problem mentioned above, that is, when you add a directory to the watches of inotify, only the modification of the first level subdirs/subfiles can cause inotify raise events, for example, if "/var/www/html" is monitored, modification of "/var/www/html/index.php" or "/var/www/html/include/" can be catched, while "/var/www/html/include/config.php" or "/var/www/html/include/etc/" can not.
This means if you create directories by
mkdir -p /var/www/html/include/etc and "/var/www/html" is monitored,
only "/var/www/html/include" will cause inotify raise the
"IN_CREATE|IN_ISDIR" event. The same rule for os.makdirs in Python,
and any copy/move recursively operations!
So it is the program(mirrord)'s responsibility to add all the necessary directories to watches. This leads to a traverse of some parts of the file system, which is resource consuming, but this traverse only need to be done once, after this named "boot init" stage, the resource can be released and mirrord will run as a daemon.
Although pyinotify.WatchManager has a "rec" argument for its add_watch(), I don't perpare to use it, because:
Here, a module fs_info.py is made use of for including/excluding.
It is a file system identity system, you can use it to split you file
system to several parts by it with include/exclude configuration, and
use the method function it afford to find a list of all the dirs/files
fit the config.
For example, you can add some dirs/files to a file system identity by this way:
sh# fs_info -a t:/var/named \ -a t:/var/www/html \ -a x:/var/www/html/cache \ -a x:/var/www/html/log website
Here fs_info is a command line interface which will invokes the
corresponding methods of the module fs_info.py depends on the options,
"-a" means "append", and t/x has the similar meaning with "-T/-X" of
tar, or say, "include/exclude".
The identity is "website", this means a file system identity "website"
contains several directories and their subdirs/subfiles, and when you
invoke mirrord, you can pass this identity as an argument to the
mirrord.py module's Mirrord constructor, then the instance can
use FsInfo.find() to get a whole list of the subdirs/subfiles of the
dirs without the excluded ones contained by this identity.
fs_info.FsInfo will build a directory "/var/fs_info/website", and record
the included(t) dirs/files in /var/fs_info/website/.t_files and
excluded(x) dirs/files in /var/fs_info/website/.x_files. You can adjust
the location by modify the configuration of fs_info:
options.datadir = "/var/mirrord" in /etc/python/fs_info_config.py.
So at last we get this view:
This graph should be displayed with monospaced fonts:
(1)---> [mirrord] ---(2)---> [fs_mirror] ---(3)
/ \
[fs_info] /==============(4)===============> [fs_sync]
/ / \
+--------------+ +---------------+
| original | | mirrored |
| file system | | file system |
+--------------+ +---------------+
(1) tell it what to monitor for mirror
(2) tell it what are modified
(3) tell it what to transport
(4) the actual regular files been transporting
By (2), fs_mirror knows what have been changed, thus call fs_sync to do the corresponding operations: the directories creating, dirs/files deleting and moving can be performed on the client side itself, so it consumes no resource of the server; but the content modifications of the regular files can't be mirrored unless the files are copied from the server. This copy can be performed by any network file transport protocols and tools, such as FTP, NFS, SMB/CIFS, SSH or rsync ..., so there can be many implemetations for fs_sync.
Some furthur instructions about this sync mechanism can be found at the end of The Fifth Story: Monitor.
After explained the "fs_info" and (1), let's come back to the previous question: What should the "Processor" (To avoid confusion, let's call it "Monitor") do? This is a part of function of "mirrord" and (2) in the view above. While for more details about fs_info, please read the code and internal document.
From the previous picture and the first story, we know that mirrord has to:
In the first story, we have seen a type of mechanism for modification recording, which has been proved not a good idea.
After that, the first breaks into my head is to setup four arrays: CREATE (IN_CREATE|IN_ISDIR only for directories), FWRITE (only for regular files writing), DELETE (IN_DELETE), MOVE (IN_MOVED_FROM and IN_MOVED_TO), and makes any path name relative to the modification be appended to the corresponding array.
My first design is to manage such arrays for every server thread, that is, every time a server thread is created, it builds these arrays for itself, because the status of every clients varies.
To manage these groups of threads relative arrays, it forces every thread to implement a inotify "Processor", or say, a "Counter", to append modifications to the arrays, and clear up the arrays once the client has read the records.
This design is also problematical:
If we take a reference to MySQL replication, the right solution raises. What should be done is to setup a Log to record all types of modifications with the corresponding path names. Let's name it "wmLog" (Watch Modify Log)
This wmLog should be shared by the main thread and the server threads, the "Monitor" in the main thread append the new modifications to the tail of wmLog, and the server threads can read the wmLog with a pointer indicates its read position, thus a server thread knowns what it has processed.
What should this wmLog like? Since the order is very important, a list (sequence) is fairly direct, but a hash table is more flexible and clear. If you have the needs to modify the wmLog, for example, delete the overdue records older than nearly 1 month to avoid the wmLog being too large, hash table is fairly useful; and with the hash table like APIs, it's easier to switch to other logging mechanisms(for example, Berkeley DB, discussed below).
To make the hashed wmLog ordered, its key is a serial number. Every time an inotify event is raised and processed, the serial number is increased by 1, so the wmLog looks like this dictionary in Python:
{
1 : ('CREATE', '/var/www/html'),
2 : ('CREATE', '/var/www/html/include'),
3 : ('FWRITE', '/var/www/html/index.php'),
4 : ('DELETE', '/var/www/html.old'),
5 : ('MOVE', ('/var/www/html', '/var/www/html.new')),
...
}
So the pointer owned by every server thread is just an integer, which can be different from main thread's serial number and each other, which indicates the different reading progress of different clients.
This mechanism makes it possible to restart client synchronization from the broken point, for example, when the main thread has increased the serial number to 10234, a server thread may just read to the position 9867, then the client may be terminated for some reasons while the main thread's serial number goes on increasing. The next time the client restarts, it can tell the server thread to transmit wmLog from the point 9867, otherwise the client has to ask for a whole new sync init, that force the server thread to resend the whole content of the file system snapshot, this resending is resource consuming(since you can always use generator of Python, it is not so memory consuming, but still CPU and network traffic consuming), and the following regular files transport is more resource consuming.
But unlike MySQL replication, so far there is no implementation to restart the mirrord from the broken point. Because between 2 times boot of mirrord, the file system may be changed without monitoring, then a simple way is to rebuild the whole snapshot from the head again, and reset wmLog to 0, this also force the client to redo sync init too, because it's necessary to guarentee the consistency of the original file system and the mirrored file system, mirrord achieve this by require the client afford the server thread a MD5 session generated upon the time it started.
One of a future version will improve this boot strategy by compare the "current" file system when booting and the snapshot last time terminated, to compute the modifications between the 2 boots, this can makes the boot init more smooth.
wmLog is a hash table, so Python's dictionary is fairly direct, but it is in memory, while the Log will be larger and larger. If we assume the average length of 1 record is 30 bytes, and the rate of modification is 1/s, then it will consumes "30 * 86400 / 1024 / 1024 = 2.47 MBytes" memory, and you can control the length of the wmLog, so you can control the upper memory limit. If the orignal file system is not so large, or the modification of it is not so frequent, it is a good idea to put the records in this in memory wmLog.
By contrast, it is sensible to use a hash like database, such as Berkeley DB. Only a simple class fs_wmlog.WmLog is created to wrap the operation of Berkeley DB, because wmLog only accept integer while BDB only accept string, this can keep consistency.
At first I use the in memory dict, but the product line I must deal with contains very large file systems with millions of dirs/files, for simplicity, this version totally turns to BDB, but I will implement both of them in the future.
My first implementation is to combind both of them to one API, I mean there are both a dict and a Berkeley DB in wmLog then, like this:
class WmLog:
def __init__(self, path):
self.memlog = {}
self.dbdlog = bsddb.btopen(path, 'n')
the most recent records are in memlog, the earliest memlog will be put into Berkeley DB when it grows too large. Then I found it's not necessary to do like this, because it's more convenient to rely on the BDB's memory management.
So far only the actions with the relative pathnames is logged and transmitted, but the files' status is also useful, furthor more, with a hostid, it is possible to make the 2 or more hosts mirroring each other. The lack of them roots to the immature of the design this first time, and will be improved in the future.
A file system snapshot is a data structrue that stores all the path names with their relative status, such as file type(directory or regualr file), permission and ownership, size and last modified time. In this version, only the pathnames and file type is contained in the snapshot, a future version will improve it.
There are several purposes why file system snapshot is necessary:
There are several implementations to achieve the file system snapshot. The simplest is a in memory hash table(dict), the key is the pathnames and the values is the relative status of the files(so far only 0/1 to represent the type of the file), but it is too memory consumed, and can't get the names of subdirs/subfiles easily, except traverse the whole keys and compare them (to the top pathname that raises the IN_MOVED_FROM inotify event). Let's call this implemetation as "PureDict".
Another implementation is to create a tree like data structrue to store
the pathnames. I have defined a basic Tree class, which can be
found in the caxes subproject of ulfs, but here I implemented a
different one, which I may will add to caxes project.
With this data structure, the internal storage of a record is like this:
>>> fssnap['var']['www']['html'] = 1 >>> fssnap['var']['www']['html']['index.php'] = 0
but it still should has the hash like APIs, to make the operations more convenient and intuitive, and be consistence with other implementations of the file system snapshot, so these operations should be valid at least:
>>> fssnap['/var/www/html'] = 1
>>> fssnap['/var/www/html/index.php'] = 0
>>> x = fssnap.pop('/var/www/html')
>>> for fname in fssnap: print fname
>>> for fname in fssnap.keys(): print fname
# Generator should be used more wildly than keys() method,
# because keys() can be too memory comsumed.
This implementation can save some memory, and the retentivity of vision funcionality can be achieved easily by this data structure. Let's name it as "SnapTree".
The third way is to build a real file system that only contains the dirs and empty regular files which have the same pathnames as the orignal file system. This snapshot can be on an Ext3fs, or on Reiserfs for more better performance (since the regular files on the snapshot file system is empty, I guess this can get more better performance), or even on Tmpfs. Let's name it "Snapfs".
All these three implemetations (and the rest implemetations that have
not been described so far) can be found in the module fs_snap.py,
and here is a comparison of the boot init stage by using them
separately:
All files number: 2939516 Dirs number: 104609 Empty dirs number: 51671 Record file size: 206MB (find /absolute/path/to/the/part >/tmp/record.txt)
| - | PureDict | Snapfs | SnapTree1 | SnapTree2 |
|---|---|---|---|---|
| Mem consumed | 388MB(19.2%) | 417MB* | 261MB(12.9%) | 254MB(12.6%) |
| Boot time | 9min | 20min | ~10min | 10min |
| CPU | 20% | 25% | 15% | 15% |
| Load average | 1(4 CPUs) | 1.5 | 1 | 1 |
| Disk I/O | R:2MB/s | R:1.5MB/s | R: 1.3MB/s | R:<1.3MB/s |
| - | W:2MB/s | W: >2MB/s | W:<0.3MB/s | W:<1.3MB/s |
* Snapfs basically not consumes memory, only consumes disk space. This result only reflects the situation on Ext3fs, the situations on Reiserfs have not been tested. Maybe it's possible to make use of RAID(RAID0) too, to improve the performance.
Tmpfs seems a good idea, but it's impractible if you research it deeply, because every inode has to occupy a single page(4096KB), although it is possible to limit the inode size, it's not convenient and can not solve the problem radically. Maybe it's possible to put most of the snapshot on swap, but I have not found the effective resource limit methods(setrlimit in Python, or limits.conf of PAM can not act as my expectation).
You can use the command df -i to check the number of the inodes that
have been consumed.
If there is a effective resource limit method, I think it also can be performed for PureDict/SnapTree.
There is another file system's comparison: All files number: 5951114, record file size: 416MB Dirs number: 227534, record file size: 12MB Empty dirs number: 115314, record file size: 6.3MB
| - | PureDict | Snapfs | SnapTree1 | SnapTree2 |
|---|---|---|---|---|
| Mem consumed | 781MB(38.6%) | -* | 530MB(26.2%) | 514MB(25.4%) |
| Boot time | 19min | - | ~23min | ~23min |
| CPU time | 6:27.77 | - | 10:32.68 | 11:01.55 |
| CPU | 18% | - | 14% | 14% |
| Load average | 1.3(4 CPUs) | - | 1 | 1 |
| Disk I/O | R:1.5MB/s | - | R: 1.3MB/s | R: 1.3MB/s |
| - | W:0.5MB/s | - | W: 0.4MB/s | W:<0.4MB/s |
To avoid the influence of other aspect of mirrord, check the script
mirrord_consumed.py in "cutils/trunk/prototypes" subdir, it
contains nothing except the snapshot implementations above, you can
check the memory it consumed by top command for the different
mechanism separately.
Now there are two types of snapshot: w(atched)dirs and (regular)files, this also comes from the immature design the first time, and cause the "get subdirs/subfiles names when IN_MOVED_FROM" function that mentioned above can not act expectly. So until now only the previous 2 requirements have been achieved, the third(subdirs when IN_MOVED_FROM) will be completed in the next version after merging the wdirs and files to just only one snapshot structure, so the next version will contain this function:
>>> x = fssnap.pop('/var/www/html')
>>> print x
{'/var/www/html' : 1, '/var/www/html/index.php' : 0}
The snapshot mechanisms described are all in memory solutions. As for wmLog, the Berkeley DB can also be applied to fit the needs of large file system with millions of small files.
BDB of course is hash like, then how can it find the subdirs/subfiles to solve the "IN_MOVED_FROM" problem? This can be resolved by BDB's BTree access method, because BTree is sequential, so when a directory raises an event, call BDB's set_location() to the prefix key path, and asks for next() continuously until key.startswith(prefix) return False or DBNotFoundError is thrown. By overload the pop() method of the hash like BDB implementation, we can get the same effect above.
Althoug BDB put the data file on disk, it has its own memory management and the performance is acceptable, and a further improvement of the performace I can image is to put BDB's data file on an RAID0 array, isn't it?
There is another requirement: smooth boot. As known, it's resource consuming to setup the snapshot and do the client first sync. If the client(fs_mirror) is terminated unexpectly, It can be restart from the broken point with the aid of wmLog by submit the serial number of the broken point and the corresponding session. But if the server(mirrord) is terminated unexpectly, it has to rebuild the whole file system snapshot when reboot, and the client has to redo the whole first sync again, because the server has no idea about what has modified during the two boots.
To solve this problem, it necessary to dig the potential of the snapshot and wmLog. Since we have the old snapshot at the broken point, it's possible to compare it with the current file system and compute the differences, make these differences to be the normal records in the wmLog, the server can keep using the old session and serial, and the client can skip the whole resync procedure!
wmLog and snapshot are operated correspondingly when the inotify events occurs. Then when and how these operations being invoked is done by inotify Monitor, as memtioned in the second story.
Besides the operations described in the second story, there are several other functionailities the "Monitor" should contains:
Manage the watches via the API afforded by pyinotify.WatchManager, as discussed above, the wd(watch descriptor) will not removed from watches when a dir/file is moved, so Monitor has to do this cleaning up, and update the file system snapshot correspoindingly.
On the other hand, the recursive problem makes inotify only raise "IN_CREATE" event for the top directory when you mkdir/cp/mv recursively, so once Monitor catch an event on directory, it should run a walk in that directory, find all its subdirs recursively to watches. Although there is a traversing, in common condition, it's necessary to only walk a very small part of the file system, thus it is not resource consuming.
There are "missed" and "vanished" problems when Monitor perform this "walk adding watches" action, let me explain them now.
As we have known, mirrord/fs_mirror is not absolute realtime solution, it processes the inotify events in intervalic loops, the modifications occurs in this loop can only be processed in the next loop, then the missed problem comes like this:
This graph should be displayed with monospaced fonts:
|
modifications | operations
------ next loop ------
|
o<---(listdir(root))
|
(create root/subdir)--->x
(create root/subdir/dnew)--->x
(fwrite root/subdir/fnew)--->x
|
o<---(add root to watch)
|
o<---(listdir(root/subdir1))
o<---(add root/subdir1 to watch)
|
o<---(listdir(root/subdir2))
o<---(add root/subdir2 to watch)
|
...
|
------ next loop ------
|
As shown above, in the gap between "listdir(root)" and "add root to watch", a new dir "root/subdir" is created, since "listdir" has been done, the application will not walk into this subdir and add it to watches correspondingly; and since the root has not been added to watches, the modification in it will not raise inotify events, thus you lose "root/subdir" and its subdirs totally -- all of them have not been added to watches, and any further modifications in them will not be notified to mirrord! They are "missed"!
An applied solution is to do the listdir() again just after the "add root to watch", the second list will find out what have been missed and walk into them forwardly.
While just a moment ago (at 24 September 2007), I think why not to "add root to watch" before "listdir(root)"? Although this will lead to adding overlap, that is, the newly created "root/subdir" will be add to watches twice separately by "walk adding watches" and "inotify process" , but it is not a serious problem and can be ignored smoothly.
But that means I must implement my own walk function rather than os.walk(), since os.walk() always do listdir() first.
I will try to achieve this in a future version since it is an immature design hole, but before that, I must make sure that the iterator/generator will be invoked sequential rather than parallel, for example:
def walk(root): if os.path.isdir(root): yield root, 1 for file in listdir(root): walk(file) else: yield root, 0
will this "walk" pause after "yield root, 1" and waiting for "root" been processed (sequential)? or it will call "listdir(root)" immediately (parallel)? If the former, this method is usable, otherwise there are problems unless there is any relative solution. To write an prototype script is necessary for me, or you already have the answer. If you know the solutions, would you please give me some tips, thank you very much :D
While the "vanished" problem plays on this way:
This graph should be displayed with monospaced fonts:
|
modifications | operations
------ next loop ------
|
(delete dir)--->x
(create dir)--->x
(delete dir)--->x
|
------ next loop ------
|
o<---(process IN_DELETE 1)
o<---(process IN_CREATE 2*)
|
o<---(process IN_DELETE 3*)
|
...
problem occurs at "process IN_CREATE 2*", when this operation is invoked , it need to call os.walk() and os.listdir(), both require an actually existed directory, but the direcotry has been deleted(vanished) by the previous modifications, this cause OSError being thrown.
"process IN_DELETE 3*" may has problems if errors in "process IN_CREATE 2*" is ignored simply, because it has to delete the inexistent items from the snapshot and watches, it is not exact and leaves holes in the system if ignores the errors again!
This can applied on "MOVED" events since "MOVED" also has the recursive problem, thus it call "walk adding watches" too. For example:
This graph should be displayed with monospaced fonts:
|
modifications | operations
------ next loop ------
|
(MOVED_TO from unmonitored)--->x
|
(MOVED_FROM to unmonitored)--->x
|
------ next loop ------
|
o<---(process IN_MOVED_TO *)
|
o<---(process IN_MOVED_FROM *)
|
...
Since before the "process IN_MOVED_TO", the dir/file has actually been removed(vanished), thus it will not be added to the snapshot and watches, then the next "process IN_MOVED_FROM" stucks by raising KeyError.
The solution of the "vanished" problem can be found in the source code and the relative unit tests.
By contrast to the "vanished" problem, there is an "explode" problem for clients. For example:
This graph should be displayed with monospaced fonts:
|
modifications | operations
------ next loop ------
|
(FWRITE /path/to/file)--->x
|
(DELETE /path/to/file)--->x
|
(CREATE DIR /path/to/file)--->x
|
------ next loop ------
|
o<---(process IN_FWRITE_CLOSE *)
|
o<---(process IN_DELETE)
|
...
the first operation "process IN_FWRITE_CLOSE" meets the problem, because it tell the remote client to sync a regular file, but the target is actually a directory! At the same time, the client also has the "vanished" problem, all the solutions about these client side problems will be discussed later.
A very useful requirement is "dynamic watch adding/deleting" ability. In the privious sections, I described in the boot init stage and "walk adding watches" procedure, it is necessary to make use of fs_info to include/exclude some dirs/files, but sometimes there may be found some dirs/files that are too big or varies too frequently(or say, the value of size*freq/min is too big), then it is useful if I can remove those dirs or files from the fs_info.FsInfo and exclude by invoke a command, such as sending a signal:
sh# mirrord -k exclude "/var/www/html/site.com/log/20070924.log"
Or on contrast, you want to add several dirs/files into watches:
sh# mirrord -k include "/var/www/html/include/new/"
so far this function has not been achieved.
The reason why the IN_CREATE only performed on directories is because for regular files, IN_WRITE_CLOSE is more meaningful, and IN_CREATE for files becomes redundant.
IN_WRITE_CLOSE indicates that the content of a regular file has been modified, as you can see in all descriptions, only then the client fs_mirror will asks for an actual file transport, while the other type modifications can be performed on the client itself consuming no resource of server.
This file transport is carried out via any existed file transport protocol, such as FTP, NFS, SMB/CIFS, etc...,
A very important characteristic is that MOVE can be executed locally, rather than be considered as CREATE/FWRITE after DELETE, the later often requires an actual files content transmission, which requires more resource of the server.
But there is no guarentee the "MOVE" will only be considered as "MOVE", there is still possibilities it will be considered as "CREATE/FWRITE" followed "DELETE", becuase IN_MOVED_FROM and IN_MOVED_TO are two events with only a cookie to relate them, once an IN_MOVED_TO event is detected , the Monitor checks the corresponding cookie of the IN_MOVED_FROM, then this pair becomes a "MOVE" action and enqueue to wmLog. If no such pair, IN_MOVED_FROM will be considered as "DELETE" and IN_MOVED_TO will be considered as "CREATE/FWRITE". Monitor decides that there is no pair in several loops (now is 2), that means if IN_MOVED_TO arrived 3 inotify process loops after the arriving of IN_MOVED_FROM, it will be considered as "CREATE/FWRITE" because the corresponding IN_MOVED_FROM event has been cleaned up, since Monitor can't wait IN_MOVED_FROM's IN_MOVED_TO pair for ever, so that IN_MOVED_FROM is considered as "DELETE".
inotify Monitor runs an infinite loop, to process the inotify relative things specially, which means it must runs another infinite loop to listen on a socket port to process the requirements from the clients ask for synchronization, except making use of select/poll (So which is the better one? Now I used the child threaded shedule).
This is the responsibility of the schedule, it just waits for several command: START (a new server thread for the client), or TERMINATE (by and only by the main thread) ...
So far, the schedule just run as a thread, and only spawn child server threads when the client requires for synchronization. But as I known, because of the Python GIL, the Python threads can not maxmize the usage of multi CPUs, so what about shedule processes?
Once the server thread started, the client communicates with it via TCP socket, so it's necessary to design a rational application protocol.
The current protocol for the client that asks for an init sync:
C: START S: OK C: INIT S: SESSION:8c0a4927a5280905e6f3ca01d5a02a53 S: SERIAL:3279 S: /var/www/html/ S: /var/www/html/include/ S: /var/www/html/syssite/ ... # Only dirs S: EOF S: /var/www/html/index.html S: /var/www/html/index.php S: /var/www/html/include/config.php ... # Only regular files S: EOF C: NEXT # Server thread blocked until there is any modification ... S: SN:3282 S: CREATE:/var/www/html/new S: DELETE:/var/www/html/tmp/s2124.html S: FWRITE:/var/www/html/files/s5237.html S: EOF C: NEXT # Server blocked ... ...
If the client gives an invalid command, the server thread should report that and terminate:
C: THIS S: INVALID COMMAND C: START S: OK C: THIS S: INVALID INIT COMMAND C: START S: OK C: INIT S: ... ... C: NXT # Should be "NEXT" S: INVALID REALTIME COMMAND
This protocol, I think has some problems, and for the furture smooth boot and only one file system snapshot characteristics, the following style seems better:
C: START\r\n S: OK\r\n C: INIT\r\n S: SESSION:8c0a4927a5280905e6f3ca01d5a02a53\r\n # Assume the upper limit is 3279 as the example above S: SERIAL:1024\r\n S: CREATE:/var/www/html/\r\n S: CREATE:/var/www/html/include/\r\n S: FWRITE:/var/www/html/include/config.php\r\n S: FWRITE:/var/www/html/index.html\r\n S: FWRITE:/var/www/html/index.php\r\n S: CREATE:/var/www/html/syssite/\r\n ... S: EOF\r\n C: NEXT\r\n S: SERIAL:2048\r\n S: ... ... S: EOF\r\n C: NEXT\r\n S: SERIAL:3072\r\n ... C: NEXT\r\n S: SERIAL:3279\r\n ... C: NEXT\r\n # Server thread blocked until there is any modification ... # And client waits for the response ... S: SERIAL:3282\r\n S: CREATE:/var/www/html/new\r\n S: DELETE:/var/www/html/tmp/s2124.html\r\n S: FWRITE:/var/www/html/files/s5237.html\r\n S: EOF\r\n C: NEXT\r\n # Server thread blocked and client waits ...
Here, the protocol becomes line based, makes it's possible to communicate with the server via telnet, and the first init sync has the similar behaviors as normal wmLog transmission, thus if the mirrord restart smoothly from the broken point and has computed the modifications between 2 boots, it only transmit those modifications as normal wmLog above (but only contains CREATE/FWRITE/DELETE actions, since detects the MOVED actions is not so easy).
I think it's necessary to limit the length of the records that transmitted every time, to avoid load overhead on the server side, thus here only 1024 records at the most is transmitted every time.
The following instructions will use this line based style to make the description more consistent, but the current version actually use the previous non line based style, this will be changed the next version.
If the client is terminated unexpectly and want to restart from the broken point, it afford the server the session and serial number, they are reserved by the client itself. The protocol is like this:
C: START\r\n S: OK\r\n C: SN:8c0a4927a5280905e6f3ca01d5a02a53, 3282\r\n S: OK\r\n C: NEXT\r\n # Server thread blocked and client waits ...
It is more likely that the client only records the last transmitted serial number (3072), thus becomes this:
C: START\r\n S: OK\r\n C: SN:8c0a4927a5280905e6f3ca01d5a02a53, 3072\r\n S: SERIAL:3282\r\n S: FWRITE:/var/www/html/program.php\r\n S: CREATE:/var/www/html/include/ext/\r\n S: DELETE:/var/www/html/tmp/s3984.html\r\n ... S: EOF\r\n C: NEXT\r\n # Server thread blocked and client waits ...
If the client gives the invalid session or serial number, the server will require the client to do the first init sync, then the client will send "INIT" command, or terminate without synchronization.
C: START\r\n S: OK\r\n C: SN:60a50c8be3147f15fe57db3fb0216599, 3072\r\n S: SESSION INVALID\r\n C: INIT\r\n ... C: START\r\n S: OK\r\n C: SN:8c0a4927a5280905e6f3ca01d5a02a53, -1\r\n S: SERIAL OVERDUE OR INVALID\r\n C: INIT\r\n ... C: START\r\n S: OK\r\n C: SN:8c0a4927a5280905e6f3ca01d5a02a53, -1\r\n S: SERIAL OVERDUE OR INVALID\r\n C: NEXT\r\n S: INVALID INIT COMMAND\r\n # Terminate the server thread
I have described the server thread locking for the first init sync in the previous story, this locking can make the changes of several variables safe and atomic, especially for the shared variable between the main thread and server threads, such as the servers Pool (a structure stores the identity of the server threads instances), client_status, current processed serial number.
So the shared memory is locked by smLock, which is an instance of threading.Lock(). If the server thread get smLock, it will change the shared.client_status to inform the main thread, get last processed serial number (shared.serial) and make a trick to lock the file system snapshot as described in the previous story.
Trick? As we have known, to do the first init sync is resource consuming, and also time consuming, so if lock the snapshot totally, the main thread will be blocked too long and accumulate too many inotify events in the inotify queue unprocessed, this may lead to queue overflowing, while if the queue length is too long, it will be too memory consuming; or the main thread and other server threads may have to wait too long doing nothing, and when they come back, they have to deal with too many exceptions such as "vanished" problem and "MOVED" problem, which cause losing of the orignal purpose to average the load based on time and to be realtime as far as possible.
The trick can be achieved by the concept of pointer, that is, at first, pointer _snap points to shared.snap, after locked by smLock, it points to shared._temp_snap, and unlock; after the first init sync for the client, asks for smLock again and points back to shared.snap, do some operations to update shared.snap from shared._temp_snap, unlock at last, all the other parts of the application just only use _snap. This can be achieved very easily in Python since Python "always" use pointer. Please read the code for more details.
This trick requires there is only one client doing the first init sync at a given time. So a client_init_Lock (threading.Lock()) is used to lock between server threads. To avoid deadlock, the client_init_lock and smLock should be orgnized this way:
class ServerThread: def __init_send(self): try: client_init_Lock.acquire() try: smLock.acquire() shared.client_status = CLIENT_INITING serial = shared.serial shared.svPool[self] = serial finally: smLock.release() ... # Do first init sync try: smLock.acquire() shared.client_status = CLIENT_INITTED finally: smLock.release() finally: client_init_Lock.release()
Here the modification of the pointer is not done by the server thread, it is done by the main thread after acquired smLock, the main thread knows to do the pointer exchange by checking the shared.client_status.
So far I use pyinotify module to catch and process the inotify events. Maybe this is my shallow, I found that there are important limitations of using pyinotify, for example:
| files num | dirs num | memory consumed |
|---|---|---|
| 2011967(150M) | 71333(4.1M) | ~70MB |
| 5951114(416M) | 227534(12M) | ~200MB |
Maybe it's necessary to write another pyinotify implementation to fit these requirements. I will try to do this.
The package contains mirrord/fs_mirror is named as cutils, which
is a subproject of ulfs. uLFS means "Your own customed and
easily managable LFS(Linux From Scratch) distribution", it
contains a set of tools to achieve this goal, such as
User Based Package Management, Configuration Sharing
Management, and this file system backup and near realtime mirror
(maybe I will implement a truely realtime mirror solution in the
future).
To install the mirrord on the server that the file system is going to be mirrored, Python 2.4 or higher version is required, and pyinotify-0.7.1 is used for this version:
sh$ tar xfz Python-2.5.1.tar.gz sh$ cd Python-2.5.1 sh$ ./configure --prefix=/usr sh$ make sh# make install sh$ tar xfj pyinotify-0.7.1.tar.bz2 sh$ cd pyinotify-0.7.1 sh$ python setup.py build sh# python setup.py install
On the client side runs fs_mirror, only Python is required, pyinotify is unnecessary.
Then install caxes-0.1.2, which is another subproject of ulfs, you
can download it from the same download page of cutils, it contains
some assistant data structrues such as a Python Tree.
sh$ tar xfz caxes-0.1.2.tar.gz sh$ cd caxes-0.1.2 sh$ python setup.py build sh# python setup.py install
At last, install cutils-0.1.1:
sh$ tar xfz cutils-0.1.1.tar.gz sh$ cd cutils-0.1.1 sh$ python setup.py build sh# python setup.py install --install-scripts=/usr/local/bin
Look at the graph in The Second Story,
to backup or mirror a file system, the first things you have to do is
tell the backup/mirror programs what to backup/mirror. You run
fs_info to do so:
server# fs_info -a t:/var/www/html system
this makes "/var/www/html" to be added to the backup/mirror included list, for the file system part with identity name "system", this identity is created when you first run this command, you can also use any names you want, such as "www", "web", "work", etc ...
Only the top dirs is necessary, so:
server# fs_info -a t:/var/www/html \ -a t:/var/www/html/include system
has the same net effect as the previous command ("/var/www/html").
To exclude some dirs/files without being backuped/mirrored, use the "x:" tag:
server# fs_info -a t:/var/www/html \ -a x:/var/www/html/cache system
this adds "/var/www/html/cache" into the exclude list.
If there are already lists, you can add them directly:
server# fs_info -a tL:/tmp/included_01 \ -a tL:/tmp/included_02 \ -a xL:/tmp/excluded_01 system
The actual things fs_info does is append the items into $datadir/$identity/.{t,x}_files, thus you can also edit them directly, or use shell script's redirect function:
server# find /var -maxdepth 1 >$datadir/$identity/.t_files
but fs_info can do several validation checking for you, for instance, the existance of the dirs/files. The choices are in your hands.
$datadir is default to "/var/fs_info", you can adjust it by editing the configuation file "/etc/python/fs_info_config.py" to be "/var/mirrord" to fit the needs of mirrord, which has the default value "/var/mirrord", and can be changed by editing the configuation "/etc/python/mirrord_config.py".
These configuration files use the data structure of Python Tree directly
that defined in the caxes package.
$identity is default to "system", as mentioned above.
These configurations can also be changed by command line options, for example:
server# fs_info -o datadir=/var/mirrord -a t:/var/www/html
Notice: The default kernel parameters of inotify is too small:
16384 /proc/sys/fs/inotify/max_queued_events 8192 /proc/sys/fs/inotify/max_user_watches
The value of "max_queued_watches" depends on the total dirs number of the file system part, you can use:
sh# find $path -type d | wc -l
to count the number, and make sure the value is bigger enough than that.
"max_queued_events" means the max length of the queue managed by inotify, the more frequent the file system varies, the bigger this value should be. If you find message like "** Event Queue Overflow **", this means "max_queued_events" is too small, and the monitor after that point is not exact and is problemic, you should restart "mirrord" again.
After this setting, it is time to boot the mirrord to monitor and record the modifications of the original file system:
server# mirrord # OR verbose: server# mirrord -v
You can always adjust the behavior of mirrord by changing the configuation file "/etc/python/mirrord_config.py" or command line options, such as the length of the wmLog (o.loglen, the longer, the more records will be reserved, which makes the client can restart from the break point with longer broken time, but may more resource and time consumed when read/update the wmLog), the bind interface and port, and the timeout of the socket when communicating with clients, or the identities collection of the file system parts that you want to be mirrored.
Wait for a while (depend on the hardwares, file system type you choose or the number the dirs/files you want to be backuped/mirrored), when the "boot init finished" message is printed out, you can switch to the client and start the fs_mirror.
A future version may eliminate this waiting stage and makes you can start the fs_mirror client immediately after the mirrord is invoked.
Before start fs_mirror, it is necessary to make the client can communicate with the mirrord and transport regular files correctly, thus you must permit the default port 2123 for mirrord (or choose another one), and the port for the protocol you choose for the regular files transport -- As described in the picture of The Second Story and The Fifth Story: Monitor, the design of mirrord/fs_mirror will support many file transport mechanisms, such as FTP, NFS, SMB/CIFS, etc ... But so far, there are still only two mechanisms: REPORT and FTP.
"REPORT" just print the actions reprented in wmLog recived by fs_mirror, although it does no actual dirs creating and regular files transporting , it is still an useful characteristic, talk about it later.
"FTP" sync mechanism will create dirs, delete or move dirs/files locally , while only the write of the regular files leads to actual file transport via FTP protocol, this can reduce the resource consuming on the server side. Notice that the whole file will be transport, not only the modified part, thus it is NOT a bytes level transport, so it is not suitable for the files too big or varied too frequently, or say, the multipled value of size and frequency is too large, or the value of "size / seconds" is too small, here seconds means the interval from time this file is transported first. A future version of fs_mirror will contain such computation feature and detemine whether to do the actual files synchronization.
These synchronization mechanisms are defined in cutils.fs_sync module.
It is a good idea to create an user named "mirror" or "backup" to own the read permissions of the file system parts that you want to be mirrored, this can be achieved by Linux's ACL function, then make this user can access the system from FTP, maybe you want to use FTP via SSL. So far no authentication mechanism has been implemented.
Edit the configuation file on the client to config fs_mirror. Read the example "etc/fs_mirror_config.py" for more detail.
Then, just invokes fs_mirror client simply:
client# fs_mirror # OR verbose: client# fs_mirror -v
"fs_mirror --help" can give you more command line options information.
As an example, you may want to change options like this:
client# fs_mirror -v -o host=roc \ -o fssync.user=root \ -o fssync.port=2121 \ --password-from=/root/.fs_mirror/secrets:roc \ --datadir=/data/fs_mirror/roc
now, the mirrored dirs/files will be put into /data/fs_mirror/$hostname, and fs_mirror will read password from a secret file with the id of $hostname, the options specified by "-o"(--option) have the same meaning of o.* options in /etc/python/fs_mirror_config.py, which has been described. Here, the hostname and 2 files synchronization parameters have been changed, of course make sure the hostname can be resolved.
If verbose is turned on, you may get these messages on your screen:
Daemon PID 7094
Connecting to the Mirrord: ('roc', 2123) ...
Conected.
Server reply 'OK' for 'START'
Server reply 'OK' for 'SN:b5e0ce480c925184dbdb6f23a62ddc6d,872302'
Received next serial message: 'SERIAL:862303'
WRITE FILE '/var/Counter/data/5597.dat', PREFIX '/data/fs_mirror/roc'
Received next serial message: 'SERIAL:862306'
CREATE DIR '/var/www/html/sample231.com/syssite', PREFIX 'data/fs_mirror/roc'
WRITE FILE '/var/www/html/sample231.com/sysstie/rec.php', PREFIX '/data/fs_mirror/roc'
DELETE DIR/FILE '/var/www/html/sample231.com/sysstie/rec.php.old', PREFIX '/data/fs_mirror/roc'
...
This procedure may take a long time (since the first init of RAID1 may be long too), thus you should avoid reboot the mirrord this version, but one of the future improvements is making the reboot of mirrord be more smooth by computing the modifications in the gap of stopping, or just ignore them with a parameter, thus can be less resource comsuming.
There is a regular file "$datadir/sn" just record the session and the serial number:
sh# cat /data/fs_mirror/roc/sn b5e0ce480c925184dbdb6f23a62ddc6d,872302
It is used when fs_mirror try to restart from the broken point. It is also a file lock (fcntl.LOCK_EX) to avoid you mirror two remote hosts' file system to one directory, or mirror a remote file system twice!
As described in the Overview section, a rotate mechanism is necessary. So far fs_mirror does not have a builtin rotate implementation, which can only be achieved in a future version, but you can write scripts to do so. There are two example script you can take as a reference in the "prototypes" subdir: prototypes/fs_mirror_rotate.py and prototypes/mirror_rotate. These 2 scripts like this:
prototypes/mirror_rotate:
#!/bin/sh
hosts="p01 2121
p02 2121
p03 2121"
cd `dirname $0`
# for host in hosts; do
echo "$hosts" | while read host port; do
pid=`ps aux | grep "fs_mirror.*$host" | grep -v 'grep' | awk '{print $2}'`
if [ $? -eq 0 ]; then
if [ -n "$pid" ]; then kill $pid; fi \
&& ./fs_mirror_rotate.py $host \
&& /usr/local/bin/fs_mirror -v \
-o host=$host \
-o fssync.user=root \
-o fssync.port=$port \
--passwd-from=/root/.mirror/secrets:$host \
--datadir=/data/fs_mirror/$host/
fi \
&& cat /data/fs_mirror/$host/www/prima/usermap \
| awk -F, '{printf("%s %s %s %s %s\n", $1, $2, $3, $4, $5)}' \
| while read perm uid user gid site; do
chown $uid.$gid /data/fs_mirror/$host/www/users/$site -R
# gid always be ftpd
done
done
prototypes/fs_mirror_rotate.py:
#!/usr/bin/python
# -*- encoding: utf-8 -*
import os,sys,shutil
import time
import datetime
mirror = "/data/fs_mirror"
backdir = "/data/hosts"
try:
host = sys.argv[1]
except IndexError:
print >> sys.stderr, "Lack of a host identity"
sys.exit(1)
day = datetime.date(*time.localtime()[:3])
new = os.path.normpath("%s/%s" % (mirror, host))
NL = len(new)
old = os.path.normpath("%s/%s/%s" % (backdir, host, str(day)))
OL = len(old)
shutil.move(new, old)
for root, dirs, files in os.walk(old):
d_new = os.path.normpath("%s/%s" % (new, root[OL:]))
os.mkdir(d_new)
# status = os.lstat(root)
# perm = status[0]
# os.chmod(d_new, perm)
# uid = status[4]
# gid = status[5]
# os.chown(d_new, uid, gid)
try:
os.chmod(d_new, 0755)
except OSError:
pass
# print "CREATE DIR '%s'" % d_new
for fname in files:
f_new = os.path.normpath("%s/%s" % (d_new, fname))
f_old = os.path.normpath("%s/%s" % (root, fname))
os.link(f_old, f_new)
# Hard link will reserve the permission and ownership of a file
try:
os.chmod(f_new, 0644)
except OSError:
pass
# print "HARD LINK '%s' -> '%s'" % (f_old, f_new)
interval = datetime.timedelta(days=14)
overdue_day = day - interval
overdue_dir = os.path.normpath("%s/%s/%s" % (backdir, host, str(overdue_day)))
try:
shutil.rmtree(overdue_dir)
except OSError, (errno, strerr):
if errno == 2:
print strerr
else:
raise
Example rotated archives are like this:
[root@stor p01]# ls -l total 144 drwxr-xr-x+ 6 root root 4096 Oct 2 03:10 2007-10-03 drwxr-xr-x+ 6 root root 4096 Oct 3 03:10 2007-10-04 drwxr-xr-x+ 6 root root 4096 Oct 4 03:10 2007-10-05 drwxr-xr-x+ 6 root root 4096 Oct 5 03:10 2007-10-06 drwxr-xr-x+ 6 root root 4096 Oct 6 03:10 2007-10-07 drwxr-xr-x+ 6 root root 4096 Oct 7 03:10 2007-10-08 drwxr-xr-x+ 6 root root 4096 Oct 8 03:10 2007-10-09 drwxr-xr-x+ 6 root root 4096 Oct 9 03:10 2007-10-10 drwxr-xr-x+ 6 root root 4096 Oct 10 03:10 2007-10-11 drwxr-xr-x+ 6 root root 4096 Oct 11 03:10 2007-10-12 drwxr-xr-x+ 6 root root 4096 Oct 12 03:10 2007-10-13 drwxr-xr-x+ 6 root root 4096 Oct 13 03:10 2007-10-14 drwxr-xr-x+ 6 root root 4096 Oct 14 03:10 2007-10-15 drwxr-xr-x+ 6 root root 4096 Oct 15 03:10 2007-10-16 [root@stor p01]# ls 2007-10-16/etc/httpd/conf.d -l total 112 -rw-r--r-- 9 root root 58 Oct 8 13:41 bw_mod.conf -rw-r--r-- 9 root root 187 Oct 8 13:41 mod_caucho.conf -rw-r--r-- 9 root root 2965 Oct 8 13:41 site.conf -rw-r--r-- 9 root root 10919 Oct 8 13:41 ssl.conf -rw-r--r-- 9 root root 85876 Oct 8 13:41 virtual.conf
the hardlink number is 8, that means the last time mirrord booted is 2007-10-08.
Notice that fs_sync will do unlink first before a regular file transport , to avoid polluting the rotated backups, since the actual thing the rotate scripts doing is simply creating dirs and making regular files hardlinks.
So far, only the modification actions with the corresponding pathnames are transmitted, so if the dirs/files mirrored have ownership/permission requirements, current mirrord/fs_mirror can do no help itself, since you know to change the ownerships and permissions requires a traverse of the whole filesystem, thus is very resource and time consuming. So I modified the rotate script to change the ownerships and permissions daily, although it is not a good idea, the more powerful function can only be achieved in a future version.
Since it is necessary to keep consistency of the ownerships and permissions, it is sensible to use a centralized authentication solution such as NIS, LDAP or Kerberos.
I must say that so far the fs_mirror client are not fully tested, becuase of my superficial experience about design and programming, and my limited time usually I have to squash spare time to push it forward, so maybe you will have to pay more time to adjust fs_mirror to work correctly.
In the examples of Overview section's rsync part, backup for high avialability was described. With mirrord/fs_mirror , just replace the "rsync" to be "mirrord", like this:
This graph should be displayed with monospaced fonts:
+----------+
| worker | -[mirrord] -----------\
+----------+ |
...... |
|
+----------+ |
| worker | -[mirrord] -----------\
+----------+ |
V
[fs_mirror]
|
+----------+ +----------+
| worker | -[mirrord] ---> | backup |
+----------+ +----------+
| |
[take_over] |
| |
V |
+----------+ |
| rescue | <------------------- NFS
+----------+
this is the multi to one backup, which is cost efficient. If one of the
worker hosts fails, you can subsitute the failed worker with the rescue
host, with the aid of any high available method, such as heartbeat
project of http://www.linux-ha.org/.
To make the "fs_mirror" on the backup host mirror the file system of several worker hosts, you should run fs_mirror like this now:
client# fs_mirror -v -o host=worker01 -o fssync.user=mirror -o fssync.port=21 \ --password-from=/root/.fs_mirror/secrets:worker01 --datadir=/data/fs_mirror/worker01/ client# fs_mirror -v -o host=worker02 -o fssync.user=mirror -o fssync.port=21 \ --password-from=/root/.fs_mirror/secrets:worker02 --datadir=/data/fs_mirror/worker02/
Make sure the hostnames can be resolved and the username "mirror" exists on all the server hosts with read permission of the whole file system to be mirrored (recommanded to be achieved by Linux ACL).
As described in the previous subsection, by this way, you can put the differerent hosts' mirrored file system contents into different directories, then make several appropriate symlinks to the corresponding subdirs of the mirrored part when taking over.
An unexpected good usage of mirrord/fs_mirror is making them to be an
IDS(Intrusion Detection System) with the "REPORT" fs_sync mechanism,
which can be considered as a realtime tripware. On the server side
, put the system part that not varies frequently under the monitor of
mirrord(inotify), such as all the executable system commands, and
configuration files, and on the client side, run fs_mirror in "REPORT"
mode, then any critical modification on the server can be reported in
"realtime" as soon as possible.
It seems a good idea to implement a GUI application to do this IDS client things, thus it can be run as a daemon on your PCs, for example on a Windows system, then it can report the critical changes of the servers quickly.
So far this usage has not be thought profoundly, some important features such as SSL support has not been implemented, but I can make sure it can be used in this way.
Flaboy #王磊, he is the first person propose to use FAM on Linux for near realtime file system mirroring, and conceived the first story. With his suggestion, I searched the Internet for a whole evening and found inotify at last.
Alex #老徐(徐唤春), he suggest me to use a log mechanism to record the modifications of the file system, which can make a reference of MySQL's replication. And we discussed some aspect of the protocol should be used between mirrord and fs_mirror.
Sebastien Martini, I uses his pyinotify module, and his feedback is very valualbe to me.