Home Home > 2008
Sign up | Login

Archive for 2008

Fate Internal, Up- and Downstream

December 16th, 2008 by

motivated by Aaron’s blog post More downstream fun I was thinking about how Fate could be a more important part of infrastructure in the Linux landscape. Fate is now an important part of the Novell/SUSE infrastructure and we are currently in the process to open it up for the openSUSE community. But could Fate also be useful for upstream integration? To let you participate in the discussion I think I should start with some explanations what Fate is and in which environment we are with it.

Fate is a system developed at SUSE over the last few years to track features and requirements for Novell Linux products.  The term “feature” is already is a topic for scientific papers, but how we understand a feature is a functionality  that is not yet in the product but required or wanted. It references future products, in most cases more than one such as SLE and openSUSE.

Fate Feature Tracking EnvironmentThe little sketch illustrates the dilemma in which we are when it comes to product planning. Basically it is all about one  thing: decision taking. Decisions have to be taken about the new functionality  that goes into a product and the tasks internal people work on. This is based on the decision how the product should look alike from a high level point of view. To make a solid decision about the high level product it needs to be clear what we are actually able to put into the product at a given time. That is only a part of what is really going on but let’s leave it with that for simplicity.

You see that lots of the base information which is needed to make
good decisions comes from different people: Product managers have a strong idea of how products should be, the technical project manager knows about dates and technical possibilities and can plan with the engineering managers how that can be achieved with the given amount of people in a given time frame. Technical feasibility is worked out with the developers as they’re the experts. The colored arrows try to visualize the communication ways, different colors mark different topics.

Since we work with the communities we have more input of information: The user community tells what is needed and the upstream communities announce what they plan to do by when.

The part with the internal decision taking is very much based on Fate in the Linux part of Novell and that is working fine. Features come in by a requester and all involved parts can give their thoughts in  a discussion forum. The key functionalities add their priority for the feature and finally PM and TPM come to a decision. Features with a high priority have to make it into the product. Engineering managers can assign developers and they can mark features as finished. All the processes are covered by a set of configurable rules. The Fate system is integrated into other infrastructure parts. There are several clients for different needs, the most mature is the KDE fat client.

What is not yet optimal are two things: There is no good way yet for the user community to community their wishes for upcoming products. We are facing that with opening up a new web based openSUSE Fate client soon. That will involve the user community not only in testing and using the product but already in its planing in a defined way.

A more tricky part is how to involve the upstream communities. It would be great for a Product Manager of a Linux distribution to see the feature plans for upcoming releases of big upstream projects, maybe somehow integrated into the Fate for his product. Would the Fate model as described here in a nutshell suitable for upstream projects?

For example, could KDE or GNOME make use of parts of the process for them internally and provide a structured interface to the downstream parties? If so, that could add a lot of transparency. Transparency is the precondition for flexibility and trust and as a result for better collaboration which would benefit all.

I hope that helps for a basic understanding and would love to hear your opinion. I promise to come up with more information about the Fate system and improve my drawing skills 😉

openSUSE Mirrorlist Improved

December 16th, 2008 by

Peter Pöml has created a dynamic list of our mirrors with accurate information about which repositories they mirror.  You can reach the link via the openSUSE wiki.  The data in the mirror list is used by our download redirector download.opensuse.org to point users automatically to mirrors close to them.

Btw. if anybody else wants to mirror openSUSE, please read these Instructions about it.  We love to increase our mirror infrastructure and there are some regions that could use additional mirrors.  Even if you only mirror parts of the openSUSE tree, you’re welcome!

Thanks a lot to all the mirror admins that mirror openSUSE – and to Peter for his continous improvements to the infrastructure!

OpenOffice_org 3.0.1 beta1 available

December 15th, 2008 by

I’m happy to announce that OpenOffice.org 3.0.1 beta1 packages are available in the Build Service OpenOffice:org:UNSTABLE project.

The packages are still beta versions and might include even serious bugs. Therefore they are not intended for data-critical usage. A good practice is to archive any important data before an use, …

We kindly ask any interested beta testers to try the package and report bugs.

Other information and plans:

I have also updated the packages in the <a href=”http://en.opensuse.org/Additional_YaST_Package_Repositories#OpenOffice.org_STABLE”>OpenOffice:org:STABLE</a> project to the version <em>3.0.0.3.6</em>. It fixes some annoying problems that were included in the <em>OOo-3.0 final</em> version. It would have made sense to put there the <em>3.0.0.9</em> version from <em>openSUSE-11.1</em> but it was not tested on older distributions in the <em>UNSTBALE</em> project, so I rather used the more conservative <em>3.0.0.3.x</em> branch. In each case, it will get replaced with <em>OOo-3.0.1</em> final version once it is ready. It should happen in the middle of January.

The <a href=”http://en.opensuse.org/Additional_YaST_Package_Repositories#OpenOffice.org_UNSTABLE”>OpenOffice:org:UNSTABLE</a> project is currently used for the <em>OOo-3.0.1</em> beta testing. I’ll put there <em>OOo-3.1-alpha </em>after we move the <em>OOo-3.0.1</em> final version to the <em>STABLE</em> project.

I am sorry that the <em>OOo-3.0.0.x</em> versions were not available for older distribution in the <em>UNSTABLE</em> project. I was very overloaded last weeks/months and did not have time to update the Build Service project.

I do not plan any other updates in December. I am going to take some time off to be fresh and ready to prepare new builds for you in 2009.

Merry Christmas and Happy New Year!

(more…)

Longstanding Wiki Problem Fixed.

December 15th, 2008 by

Today, I fixed a long-standing, serious problem in the openSUSE wiki:

15-20 seconds long hangs that one would see when loading pages. It occured only when one was logged in, and it happened (for me) on every second access. So one page load hang, the next would work, the next would hang again, and so on. Eventually each page would load and show correctly, but after an initial waiting time of nearly 20 seconds.

I experienced this on Friday, and was not happy about it, because we are preparing for an openSUSE release. I fixed it today.

What follows is the complete story of how I debugged it. It was quite complicated, took hours, and can hopefully serve as an educational example of the diagnostic thinking and different debugging methods. I always planned to write up things like this 🙂

So at first, I made sure that the hang is not in my browser, or my network. I could reproduce it from a different machine in a different browser. Next, I tried to reproduce it from an Internet server on the commandline. After grabbing the exact Cookie headers from Firefox (with the Firebug extension), it was possible. (Note: all sensitive data in the following logs, like cookies, is garbled.)

date; curl -sv -H 'Cookie: opensuse_enUserName=Poeml; bblastvisit=1227878957; bblastactivity=0; s_sq=%5D%5D; s_cc=true; sessionhash=392aee91a9d014da0d7; opensuse_en=6eb4a0c94855; opensuse_de=3jtsao6a6jr8mrfm1; ZNP00-opensuse=09a41c; opensuse_deUserName=Poeml; opensuse_enLoggedOut=20081212192257; opensuse_deLoggedOut=20081212213443; IPC8ef153=e3c5b0f; opensuse_deUserID=87; cmTPSet=Y; cm=u2=1104985&pi=User%3APoeml; opensuse_enUserID=32' http://en.opensuse.org/User:Poeml; date
Next was to find out where the hang occurs. The normal way that the request should take is from my client to the iChain servers that are in front of the openSUSE wiki servers, and they should pass it on to one of the servers. (The iChain proxy acts as a reverse proxy, and sends the request to the backend server(s) as if it was the client itself.)

Fortunately, the setup allowed to narrow down the affected webserver to a single one, due to the way that the proxy distributes the requests to the backends.

After configuring iChain to add X-Forwarded-For with the clients real IP address, and configuring Apache to log this header, my hanging request could clearly be seen in the Apache access log. Apache is configured to log an additional client header like this by adding %{x-forwarded-for}i to the log lin template, that’s defined with a LogFormat directive (in /etc/apache2/mod_log_config.conf at our Apache). I just appended it to the end of the line:

LogFormat "%h %l %u %t \"%r\" %>s %b \
\"%{Referer}i\" \"%{User-Agent}i\" %{x-forwarded-for}i"         combined

Now, the next step was to reproduce the hanging request on the local machine. This in order to rule out that the request might be “stuck” in the proxy, before it actually reaches Apache. However, when I did the request from locally (on the web server), it didn’t cause the hang! But a request that is passed through the iChain proxy looks different; it gets additional session state cookies. So it seems like a good idea to capture the actual request headers that Apache sees.

This can be done with Apache’s mod_log_forensic module. This module was written to be able to debug crashes in Apache that are caused by specific requests. Apache generally logs after it completed processing a request. So a fatal request would normally never be logged (if it crashes the Apache process, the parent would log the segmentation fault, but the request cannot be logged anymore). mod_log_forensic logs each request before Apache starts working on it, and adds a unique id to the logfile; and when the request is done, the module logs another line, just with the unique id. So after a crash, one could check with the logfile and find requests that weren’t fully handled. A single directive is needed: ForensicLog /var/log/apache2/forensic_log. (A script that analyzes such a log file is provided as /usr/bin/check_forensic2.)

Incidentally, the module logs all request headers, so it is simple to capture the headers this way that I was interested in:

+6a73:4942cb0e:0|GET / HTTP/1.1|User-Agent:curl/7.15.1 (x86_64-suse-linux) libcurl/7.15.1 OpenSSL/0.9.8a zlib/1.2.3 libidn/0.6.0|Host:de.opensuse.org|Accept:*/*|Connection:close, TE|X-Forwarded-For:121.11.11.121|Cookie:opensuse_enUserName=Poeml; bblastvisit=1227878957; bblastactivity=0; rr_t=%25255B55D%25255D; s_cc=true; sessionhash=52aee91a9d014da0d7; opensuse_en=576eb4a0c94855; opensuse_de=ao6a6jr8mrfm1; opensuse_deUserID=87; opensuse_deUserName=Poeml; opensuse_enLoggedOut=20081212192257|TE:chunked, identity, deflate|Via:1.1 ICS_SERVER (iChain 2.3.416)|X-username:poeml|X-email:poeml@novell.com
-6a73:4942cb0e:0

With those headers, the problem was indeed reproducible locally on the machine. Now, it could be taken off the network (so it doesn’t get any live requests more from users), and I could start to look closer at Apache. Apache uses mod_php, which is executing PHP scripts that implement the wiki functionality, which talk to a backend database, and also talk to a memcache daemon which is used to cache user data and page results, to serve them faster. As one could easily tell with netstat, Apache was talking to both of them.

To look at the timeline of these things happening, strace is useful to trace the system calls that are done. The question, though, is which Apache process to trace?

Normally, “ps axO ppid,wchan | grep httpd” can be used to determine the process which is going to get the next request:


# ps axO ppid,wchan | grep httpd
3614 1 select S ? 00:01:44 httpd2-worker
10921 3614 pipe_w S ? 00:00:12 httpd2-worker
22615 3614 pipe_w S ? 00:00:01 httpd2-worker
23860 3614 skb_re S ? 00:00:27 httpd2-worker
23861 3614 select S ? 00:00:00 httpd2-worker

The child (not parent) which is in select() is the one. However, the above example is from another machine; it turned out that on x86_64, the calls are not showing correctly by “ps”. Thus, I needed to start Apache in single process mode, which avoids the ambiguity. So, we use strace with timestamps (-tt) to start Apache in single process mode:


strace -f -s 1024 -o httpd2.strace -tt /usr/sbin/httpd2-prefork -f /etc/apache2/httpd.conf -DONE_PROCESS -DNO_DETACH &

The trace showed a lot of normal things happening, and at one point during the request processing it starts showing successive leaps of 1 second. The first one is this one:

4059  06:47:54.279227 setsockopt(19, SOL_SOCKET, SO_RCVTIMEO, "\2003\341\1", 16) = 0
4059  06:47:54.280772 sendto(18, "set opensuse_en:pcache:idhash:7452-0!1!0!ISO 8601!0!en!0 1 86400 1183\r\nO:12:\"ParserOutput\":7:{s:5:\"mText\";s:961:\"gt;gt;gt;\ngt;gt;gt;\nYou see me working here\ngt;gt;gt;\nHello!\ngt;gt;I started to work for SuSE in 2000. For some years, I maintained Apache, OpenSSL, DHCP and other packages.\ngt;gt;Today, I maintain the openSUSE gt;download infrastructuregt; and gt;mirror ing frameworkgt;. (Contact address for thay is ftpadmin at suse.de.) Currently I work on metalink support.\ngt;gt;I also work on the gt;Build Servicegt;. I star ted to write gt;oscgt; in the earlier days.\ngt;gt;You\'ll find me on freenode as \"mirrorbrain\" (previously: DuDE)\ngt;\n<!-- Saved in parser cache wi th key opensuse_en:pcache:idhash:7452-0!1!0"..., 1256, MSG_DONTWAIT, NULL, 0) = 1256
4059  06:47:54.281212 poll([{fd=18, events=POLLIN|POLLERR|POLLHUP}], 1, 1000) = 0
4059  06:47:55.285308 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 0) = 0

Some explanations about this. File descriptor 18 is the socket to the memcache daemon, 19 is to mysql. We see a write to memcached, and two poll() calls. The first one is polling on the memcached fd and it’s called with a timeout of 1000 ms. The = 0 means that the call returns 0, and 0 from poll() means that it did run into the timeout. Thus, here’s our one-second-leap. strace logs the timestamps at the start of each call, and thus the following poll happens 1000ms after the other. I was not entirely sure, but I verified it with a tiny test program:

int main() { sleep(1); return 0; }

Thus, PHP is waiting to read from the memcache daemon, and doesn’t get anything to read.

This could happen if the length that PHP intends to write into the memcache is calculated wrongly; http://code.sixapart.com/svn/memcached/trunk/server/doc/protocol.txt explains the protocol, and such a (still hypothetical) issue is easily reproducible when speaking to the memcache daemon:


# telnet localhost 11211
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
set foo 1 60 2 <-- I'm saying '2' bytes will be sent
x <-- I sent only one

Here, both ends will hang and wait forever.

However, this was just a theory, as I was not yet sure if I read the strace 100% correctly, and don’t think on false assumption.

Now, apart from the fact that it was Sunday afternoon, and I had some time to continue debugging it, something unexpected happened:

When I coincidentally logged into the wiki, I got the hang there again – from the live server (other box). Thus, the other server(s) were clearly also affected. But worse, my test case was gone — I could no longer reproduce the issue locally. Oh well! This was unfortunate, because I was just about to capture a better strace, which captures full length, to show the complete and exact memcached communication. I used 8192 bytes of string capture length and got a trace from that. However, 8192 bytes were not enough in the interesting place, and I intended to increase it more, but that was the moment when the problem wasn’t reproducible anymore.

Now, it seemed highly important to fix this, because it seemed to affect all servers; and if it happens during high load in multiple cases then any of the machines might explode pretty soon, because the time that an Apache process handles a requests this way is extremely long, and they accumulate…

So I switched to continue debugging on the live server, this time in a non-intrusive fashion. Still trying to find out whether the communication with mysql or memcached was at fault.

The gathered data seemed consistent so far, but to rule out false assumption, I
wanted to double check it with tcpdump. I took a tcpdump, and dumped the
protocol to align it with the timestamps:


tcpdump -s 8192 -i lo -w poeml-memcached-tcpdump-broken port 11211 &
# another test request
curl ...
# stop the tcpdump
kill %tcp
# look at the trace
tcpdump -X -r poeml-memcached-tcpdump-broken |vi -

Below is an dump that doesn’t show the full packets (-X), but I annotated it with the protocol chatter:

(You need a very wide browser window to view it completely.)

07:43:38.774044 IP localhost.4911 > localhost.11211: S 4021632256:4021632256(0) win 32767 
07:43:38.774087 IP localhost.11211 > localhost.4911: S 4017670477:4017670477(0) ack 4021632257 win 32767 
07:43:38.774115 IP localhost.4911 > localhost.11211: . ack 1 win 8192 
07:43:38.774227 IP localhost.4911 > localhost.11211: P 1:29(28) ack 1 win 8192                get get.opensuse_en:user:id:32
07:43:38.774251 IP localhost.11211 > localhost.4911: . ack 29 win 8192 
07:43:38.774320 IP localhost.11211 > localhost.4911: P 1:941(940) ack 29 win 8192             VALUE ... END
07:43:38.774362 IP localhost.4911 > localhost.11211: . ack 941 win 8662 
07:43:38.775164 IP localhost.4911 > localhost.11211: P 29:44(15) ack 941 win 8662             get lag_times
07:43:38.775211 IP localhost.11211 > localhost.4911: P 941:946(5) ack 44 win 8192             END
07:43:38.777246 IP localhost.4911 > localhost.11211: P 44:111(67) ack 946 win 8662            set lag_times
07:43:38.777311 IP localhost.11211 > localhost.4911: P 946:954(8) ack 111 win 8192            STORED
07:43:38.814846 IP localhost.4911 > localhost.11211: . ack 954 win 8662 
07:43:38.829412 IP localhost.4911 > localhost.11211: P 111:1046(935) ack 954 win 8662         set set.opensuse_en:user:id:32
07:43:38.829468 IP localhost.11211 > localhost.4911: P 954:962(8) ack 1046 win 8660           STORED
07:43:38.829487 IP localhost.4911 > localhost.11211: . ack 962 win 8662 
07:43:38.845369 IP localhost.4911 > localhost.11211: P 1046:1104(58) ack 962 win 8662         get get.opensuse_en:pcache:idhash:7452-0!1!0!ISO.8601!0!en!0
07:43:38.845421 IP localhost.11211 > localhost.4911: P 962:967(5) ack 1104 win 8660           END
07:43:38.845520 IP localhost.4911 > localhost.11211: P 1104:1149(45) ack 967 win 8662         incr opensuse_en:stats:pcache_miss_absent 1
07:43:38.845572 IP localhost.11211 > localhost.4911: P 967:972(5) ack 1149 win 8660           622
07:43:38.846039 IP localhost.4911 > localhost.11211: P 1149:1175(26) ack 972 win 8662         get get.opensuse_en:messages
07:43:38.846087 IP localhost.11211 > localhost.4911: P 972:2372(1400) ack 1175 win 8660       VALUE
07:43:38.846138 IP localhost.11211 > localhost.4911: . 2372:18756(16384) ack 1175 win 8660 
07:43:38.846150 IP localhost.4911 > localhost.11211: . ack 18756 win 12332 
07:43:38.846183 IP localhost.11211 > localhost.4911: P 18756:34486(15730) ack 1175 win 8660   VALUE (cntd.)             *** no END yet! ***
07:43:38.861524 IP localhost.4911 > localhost.11211: P 1175:2431(1256) ack 34486 win 12332    set opensuse_en:pcache:idhash:7452-0!1!0!ISO.8601!0!en!0 1 86400 1183
07:43:38.898856 IP localhost.11211 > localhost.4911: . ack 2431 win 8660 
07:43:39.863552 IP localhost.4911 > localhost.11211: P 2431:2479(48) ack 34486 win 12332      get opensuse_en:watchlist:user:32:page:2:Poeml
07:43:39.863576 IP localhost.11211 > localhost.4911: . ack 2479 win 8660 
07:43:40.867489 IP localhost.4911 > localhost.11211: P 2479:2536(57) ack 34486 win 12332      set opensuse_en:watchlist:user:32:page:2:Poeml
07:43:40.867508 IP localhost.11211 > localhost.4911: . ack 2536 win 8660 
07:43:41.908938 IP localhost.4911 > localhost.11211: P 2536:2579(43) ack 34486 win 12332      get opensuse_en:messages:Project_Overview
07:43:41.908966 IP localhost.11211 > localhost.4911: . ack 2579 win 8660 
07:43:42.908826 IP localhost.4911 > localhost.11211: P 2579:2622(43) ack 34486 win 12332      get opensuse_en:messages:Project.Overview
07:43:42.908854 IP localhost.11211 > localhost.4911: . ack 2622 win 8660 
07:43:43.915679 IP localhost.4911 > localhost.11211: P 2622:2667(45) ack 34486 win 12332      get opensuse_en:messages:How_to_Participate
07:43:43.915706 IP localhost.11211 > localhost.4911: . ack 2667 win 8660 
07:43:44.927676 IP localhost.4911 > localhost.11211: P 2667:2712(45) ack 34486 win 12332      get opensuse_en:messages:How.to.Participate
07:43:44.927703 IP localhost.11211 > localhost.4911: . ack 2712 win 8660 
07:43:45.934733 IP localhost.4911 > localhost.11211: P 2712:2752(40) ack 34486 win 12332      get opensuse_en:messages:Documentation
07:43:45.934759 IP localhost.11211 > localhost.4911: . ack 2752 win 8660 
07:43:46.942543 IP localhost.4911 > localhost.11211: P 2752:2792(40) ack 34486 win 12332      get opensuse_en:messages:Documentation
07:43:46.942568 IP localhost.11211 > localhost.4911: . ack 2792 win 8660 
07:43:47.945386 IP localhost.4911 > localhost.11211: P 2792:2830(38) ack 34486 win 12332      get opensuse_en:messages:Communicate

Observing at the timestamps, we see a protocol desynchronization happening which makes the rest of the communication stuck (only failed gets follow).

Where does it happen? I checked the (memcache value) length that PHP gives, and it is correct, and reflects the actual data length that it sends. When it claimed (somewhere else) to send 1183 bytes, it actually sent 1183 bytes. So, the

> get get.opensuse_en:messages
> VALUE
> VALUE (cntd.)             *** no END yet! ***

part may actually be a false alarm, which could happen if the snap
length of the tcp dump was too small. (It was 8192.)

Indeed, when I did another dump with maximal snaplength (64k), it showed the “get” reading the complete VALUE, including the trailing “END” line that memache daemon appends.

Thus, the subsequent command could be at fault. And indeed it is the culprit. Look at the trace:

09:21:04.921611 IP localhost.4720 > localhost.11211: P 1176:2432(1256) ack 34487 win 12332 
        0x0000:  4500 051c 77e6 4000 4006 bff3 7f00 0001  E...w.@.@.......
        0x0010:  7f00 0001 1270 2bcb 5f1c dd2e 5fd5 8ada  .....p+._..._...
        0x0020:  8018 302c 061c 0000 0101 080a 172e af66  ..0,...........f
        0x0030:  172e af61 7365 7420 6f70 656e 7375 7365  ...aset.opensuse
        0x0040:  5f65 6e3a 7063 6163 6865 3a69 6468 6173  _en:pcache:idhas
        0x0050:  683a 3734 3532 2d30 2131 2130 2149 534f  h:7452-0!1!0!ISO
        0x0060:  2038 3630 3121 3021 656e 2130 2031 2038  .8601!0!en!0.1.8
        0x0070:  3634 3030 2031 3138 330d 0a4f 3a31 323a  6400.1183..O:12:
        0x0080:  2250 6172 7365 724f 7574 7075 7422 3a37  "ParserOutput":7
        0x0090:  3a7b 733a 353a 226d 5465 7874 223b 733a  :{s:5:"mText";s:
        0x00a0:  3936 313a 223c 703e 3c62 7220 2f3e 0a3c  961:ou.see.me.wo
        0x0110:  726b 696e 6720 6865 7265 0a3c 2f70 3e3c  rking.here..Hell
        0x0130:  3c2f 703e 3c70 3e49 2073 7461 7274 6564  ./p..p.I.started
        0x0140:  2074 6f20 776f 726b 2066 6f72 2053 7553  .to.work.for.SuS
        0x0150:  4520 696e 2032 3030 302e 2046 6f72 2073  E.in.2000..For.s
        0x0160:  6f6d 6520 7965 6172 732c 2049 206d 6169  ome.years,.I.mai
        0x0170:  6e74 6169 6e65 6420 4170 6163 6865 2c20  ntained.Apache,.
        0x0180:  4f70 656e 5353 4c2c 2044 4843 5020 616e  OpenSSL,.DHCP.an
        0x0190:  6420 6f74 6865 7220 7061 636b 6167 6573  d.other.packages
        0x01a0:  2e0a 3c2f 703e 3c70 3e54 6f64 6179 2c20  .../p..p.Today,.
        0x01b0:  4920 6d61 696e 7461 696e 2074 6865 206f  I.maintain.the.o
        0x01c0:  7065 6e53 5553 4520 3c61 2068 7265 663d  penSUSE..a.dow
        0x0210:  6e6c 6f61 6420 696e 6672 6173 7472 7563  nload.infrastruc
        0x0220:  7475 7265 3c2f 613e 2061 6e64 203c 6120  ture./a..and..a.
        0x0270:  6f72 696e 6720 6672 616d 6577 6f72 6b3c  oring.framework.
        0x0290:  6472 6573 7320 666f 7220 7468 6179 2069  dress.for.thay.i
        0x02a0:  7320 6674 7061 646d 696e 2061 7420 7375  s.ftpadmin.at.su
        0x02b0:  7365 2e64 652e 2920 4375 7272 656e 746c  se.de.).Currentl
        0x02c0:  7920 4920 776f 726b 206f 6e20 6d65 7461  y.I.work.on.meta
        0x02d0:  6c69 6e6b 2073 7570 706f 7274 2e0a 3c2f  link.support...p
        0x02f0:  206f 6e20 7468 6520 3c61 2068 7265 663d  .on.the..a.Build
        0x0330:  7669 6365 3c2f 613e 2e20 4920 7374 6172  vice./a...I.star
        0x0340:  7465 6420 746f 2077 7269 7465 203c 6120  ted.to.write..a.
        0x0390:  6865 2065 6172 6c69 6572 2064 6179 732e  he.earlier.days.
        0x03a0:  0a3c 2f70 3e3c 703e 596f 7527 6c6c 2066  ../p..p.You'll.f
        0x03b0:  696e 6420 6d65 206f 6e20 6672 6565 6e6f  ind.me.on.freeno
        0x03c0:  6465 2061 7320 226d 6972 726f 7262 7261  de.as."mirrorbra
        0x03d0:  696e 2220 2870 7265 7669 6f75 736c 793a  in".(previously:
        0x03e0:  2044 7544 4529 0a3c 2f70 3e0a 3c21 2d2d  .DuDE)../p...";s
        0x0470:  616e 6775 6167 654c 696e 6b73 223b 613a  anguageLinks";a:
        0x0480:  303a 7b7d 733a 3134 3a22 6d43 6174 6567  0:{}s:14:"mCateg
        0x0490:  6f72 794c 696e 6b73 223b 613a 303a 7b7d  oryLinks";a:0:{}
        0x04a0:  733a 3137 3a22 6d43 6f6e 7461 696e 734f  s:17:"mContainsO
        0x04b0:  6c64 4d61 6769 6322 3b62 3a30 3b73 3a31  ldMagic";b:0;s:1
        0x04c0:  303a 226d 4361 6368 6554 696d 6522 3b73  0:"mCacheTime";s
        0x04d0:  3a31 343a 2232 3030 3831 3231 3431 3632  :14:"20081214162
        0x04e0:  3130 3422 3b73 3a38 3a22 6d56 6572 7369  104";s:8:"mVersi
        0x04f0:  6f6e 223b 733a 353a 2231 2e35 2e30 223b  on";s:5:"1.5.0";
        0x0500:  733a 3130 3a22 6d54 6974 6c65 5465 7874  s:10:"mTitleText
        0x0510:  223b 733a 303a 2222 3b7d 0d0a            ";s:0:"";}..
09:21:04.962876 IP localhost.11211 > localhost.4720: . ack 2432 win 8660 
        0x0000:  4500 0034 ab75 4000 4006 914c 7f00 0001  E..4.u@.@..L....
        0x0010:  7f00 0001 2bcb 1270 5fd5 8ada 5f1c e216  ....+..p_..._...
        0x0020:  8010 21d4 5f95 0000 0101 080a 172e af70  ..!._..........p
        0x0030:  172e af66                                ...f
09:21:05.930092 IP localhost.4720 > localhost.11211: P 2432:2480(48) ack 34487 win 12332 
        0x0000:  4500 0064 77e7 4000 4006 c4aa 7f00 0001  E..dw.@.@.......
        0x0010:  7f00 0001 1270 2bcb 5f1c e216 5fd5 8ada  .....p+._..._...
        0x0020:  8018 302c 3bd2 0000 0101 080a 172e b062  ..0,;..........b
        0x0030:  172e af70 6765 7420 6f70 656e 7375 7365  ...pget.opensuse
        0x0040:  5f65 6e3a 7761 7463 686c 6973 743a 7573  _en:watchlist:us
        0x0050:  6572 3a33 323a 7061 6765 3a32 3a50 6f65  er:32:page:2:Poe
        0x0060:  6d6c 0d0a                                ml..

Looks normal, except that the memcached reply is an empty TCP ACK, without the “STORED” confirmation. And we see the next command already being issued, which is the first in the series of those that are never replied to.

So what’s wrong?

The command that PHP sends is broken. By the protocol specs, the key names must not contain spaces. The syntax is:

set key flags ttl length

But as the above trace shows, the command _does_ contain a space:

set opensuse_en:pcache:idhash:7452-0!1!0!ISO 8601!0!en!0 1 86400 1183\r\n
                                            ^

This turns the command, that intends to write 1183 bytes of data, into one that indicates to memcached it’ll write 86400 bytes of data.

Of course, these will never be sent, and memcached will hang forever, eating all further commands up as perceived data, and never reply in this connection.

Thus, the wiki uses keys that are not suitable, while the PHP memcache implementation expects them to be valid.

09:21:04.921611 IP localhost.4720 > localhost.11211: P 1176:2432(1256) ack 34487 win 12332 
        0x0000:  4500 051c 77e6 4000 4006 bff3 7f00 0001  E...w.@.@.......
        0x0010:  7f00 0001 1270 2bcb 5f1c dd2e 5fd5 8ada  .....p+._..._...
        0x0020:  8018 302c 061c 0000 0101 080a 172e af66  ..0,...........f
        0x0030:  172e af61 7365 7420 6f70 656e 7375 7365  ...aset.opensuse         # set opensuse_en:pcache:idhash:7452-0!1!0!ISO 8601!0!en!0 1 86400 1183
        0x0040:  5f65 6e3a 7063 6163 6865 3a69 6468 6173  _en:pcache:idhas                                                               ^
        0x0050:  683a 3734 3532 2d30 2131 2130 2149 534f  h:7452-0!1!0!ISO
        0x0060:  2038 3630 3121 3021 656e 2130 2031 2038  .8601!0!en!0.1.8
        0x0070:  3634 3030 2031 3138 330d 0a4f 3a31 323a  6400.1183..O:12:         # claims it'll write 1183 bytes. And it does so
        0x0080:  2250 6172 7365 724f 7574 7075 7422 3a37  "ParserOutput":7
        0x0090:  3a7b 733a 353a 226d 5465 7874 223b 733a  :{s:5:"mText";s:
        0x00a0:  3936 313a 223c 703e 3c62 7220 2f3e 0a3c  961:"t;img.src="

So, I fixed the PHP memcache client to sanitize all memcache keys before usage, replacing spaces with underscores.
The fix looks like this:

function sanitize_key($key) {
    /* make sure there are no spaces in the key
       otherwise, it'll lead to a protocol desynchronization, with beautiful hangs
       poeml, Sun Dec 14 10:24:12 MST 2008 */
    $key = str_replace(" ", "_", $key);
    return $key;
}

/* and then in all places where keys are used: */
    $key = $this->sanitize_key($key);

Indeed, with the sanitized memcache keys, the problem is immediately fixed!

To verify the effectiveness of the patch, I run tcpdump in the background, with ‘grep -A3 opensuse’ attached, to basically limit the output to show the key values:

Before :
        0x0030:  1748 336d 6765 7420 6f70 656e 7375 7365  .H3mget.opensuse
        0x0040:  5f65 6e3a 7063 6163 6865 3a69 6468 6173  _en:pcache:idhas
        0x0050:  683a 3734 3532 2d30 2131 2130 2149 534f  h:7452-0!1!0!ISO
        0x0060:  2038 3630 3121 3021 656e 2130 0d0a       .8601!0!en!0..
--
After:
        0x0030:  1748 62d3 6765 7420 6f70 656e 7375 7365  .Hb.get.opensuse
        0x0040:  5f65 6e3a 7063 6163 6865 3a69 6468 6173  _en:pcache:idhas
        0x0050:  683a 3734 3532 2d30 2131 2130 2149 534f  h:7452-0!1!0!ISO
        0x0060:  5f38 3630 3121 3021 656e 2130 0d0a       _8601!0!en!0..

A simple bug, in the end. Luckily, with a trivial fix. But quite a long way to go to find it.
And I hope this writeup is useful.

Contrib Repository Kicked Off

December 11th, 2008 by

During this week the so-called Contrib repository has been finally really started after the discussion come to a halt some time ago.  Contrib (see here for details) is a repository managed by community members that will be frozen at release time to make more packages available for openSUSE users.  It is a kind of “Universal” repository for third-party packages.

To see how it works, I updated today Greg KH‘s smugbatch package that I put into my home project a couple of months ago and submitted it to contrib with the following command:

osc submitreq create home:a_jaeger smugbatch openSUSE:Factory:Contrib smugbatch -m \
"Add smugbatch to CONTRIB"

This produces an email to the opensuse-contrib mailing list and now my request gets reviewed by the maintainer team of Contrib.

Thanks a lot to Alexey for starting the contrib project a couple of months ago and to all volunteers that are driving it now!

Btw. smugbatch is a command line tool to manage photos on SmugMug – you can see some of my photos at my personal SmugMug page.

Getting GroupWise 8 client to work on openSUSE 11.1

December 10th, 2008 by

Since openSUSE 11.1 has gone gold and GroupWise 8 has been around for a while, I decided to install openSUSE 11.1 with KDE 4.1 on my laptop. I use GroupWise and when I tried to install the GroupWise rpm’s, it complained about missing dependancies. I had to install the following software in order for GroupWise 8 to work:

openmotif, openmotif22-libs and libstdc++33

Once that was installed the GroupWise rpm’s installed without any problems

Setting up a simple router

December 10th, 2008 by

Today I spent the best part of the morning to configure a SLES 10 SP2 server as a simple router. I googled quite a lot and could not find a nice and concise post on how to configure SLES 10 as a simple router, so I thought I’ll write up my experiences in the hopes that it will help someone. Although this was done on SLES 10 I am pretty sure it will apply to openSUSE as well.

My setup is as follows: Local lan is 192.168.0.0/24 and I want a private lan on 192.168.1.0/24 so that I can run stuff like DHCP and PXE without upsetting everyone on the network. I have a SLES 10 server with 2 network cards: eth1 connected to the local lan and eth0 connected to my private switch. The local lan is connected via ADSL to the Internet and I need my private network to also be able to connect to the Internet.

(more…)

Testing “Scratch” – the easy programming language

December 7th, 2008 by

Some german teacher point me to “Scratch” a few weeks ago. This week, I tried to create a package for openSUSE.

The good news: Ubuntu people already tried to package scratch for debian – but they use a perl script, which installs scratch for each user – again, and again, and again….

Well – as Scratch runs via “Squeak”, which is available for openSUSE since a long time now, it took me just a few minutes to create a simple wrapper script calling squeak and loading the scratch image. The only things left to do:

  • unpack the scratch image (I currently use the ubuntu file for this – otherwise I have to unzip the exe file, let’s look at this for the next version)
  • add a desktop icon
  • add a desktop entry
  • place everything in the filesystem

ready.

As result, every user on a system with the scratch RPM can now run scratch without any further actions. Sometimes packaging can be so easy 😉

But, yes: that’s only packaging a binary into a RPM – normally, we want to compile the source so we have full control over the binary. If someone has a problem with this package, I might not be able to help, because I haven’t seen the source and therefore have no chance to help. But perhaps, with requests from Ubuntu and openSUSE users, the developers of scratch open their sourcecode to the public.

So: have fun, playing around with Scratch

GMC LiveCDs Seek Testers

December 5th, 2008 by

Coolo just announced on the opensuse-factory list:

Strictly speaking it’s not Factory, but I put the GMC live cds at the usual place: http://ftp.suse.com/pub/projects/FactoryLiveCDs/

It would be good to give the 11.1 kernel some testing.

GMC stands for Goldmaster Candidate – which means that these LiveCDs might be declared next week as goldmaster – and goldmaster is another name for the final release.

Note that we currently have no full repository for openSUSE 11.1 out – the one available contains RC1 and factory is not building at all, the sources are updated but not the binaries.  We use for building the distribution the 11.1 distribution in the openSUSE Build Service.

openSUSE 11.1 Goes RC2

December 3rd, 2008 by

Yesterday Coolo released openSUSE 11.1 RC2 internally to fix all existing bugs marked as shipstopper.  We use now the SHIP_STOPPER flag in bugzilla to mark bugs as blocking the release (see our wiki for the usage of SHIP_STOPPER) and not anymore BLOCKER (which blocks testing or development).  If you want to search in bugzilla for the currently open SHIP_STOPPERS, you can use this query.

We also found one really annoying bug in our configuration that lead to the following output:

$ rpm -q –queryformat ‘%{vendor}\n’ glibc
openSUSEopenSUSE:11.1 openSUSE:11.1

Therefore, the openSUSE Build Service is currently rebuilding all 11.1 rpm packages with a corrected vendor tag.  Also a couple of packages where checked in to fix bugs so that we soon get RC3.

If RC3 passes all tests, we hope to release it publicly as Goldmaster – the release date is still the 18th of December.

Due to the time it takes to release a build publicly, we are not releasing any RCs besides RC1 to the public.

I’m running RC2 on my machines now and I’m quite happy with this release.