Jump to content

Anyone understand Linux routing source code?


Chris_B
 Share

Recommended Posts

I'm trying (and failing) to understand how the route cache hashing in 2.4.9's route.c works. I think I have a hash bucket overflow happening on a box, but can't prove it unless I can figure out the hashing code and run a few addresses manually to work out which hash bucket they'd occupy.

My brain hurts. smashfreakB.gif

Link to comment
Share on other sites

[ QUOTE ]

Chris, if you like you can send me the code (my email address is in my profile) and I'll take a look.

Am out quite a bit today, but I'll certainly see if I can help.

Trev

[/ QUOTE ]

Thanks Trev. It's not really urgent (somebody else has swapped the box out for a BSD-based one), and can go on here to edumacate (or baffle) the rest of the IT guys here.

The code in 2.4.9 appears to be the same as the code here.

<font class="small">Code:</font><hr /><pre>207 static __inline__ unsigned rt_hash_code(u32 daddr, u32 saddr, u8 tos)

208 {

209 unsigned hash = ((daddr & 0xF0F0F0F0) >> 4) |

210 ((daddr & 0x0F0F0F0F) << 4);

211 hash ^= saddr ^ tos;

212 hash ^= (hash >> 16);

213 return (hash ^ (hash >> 8)) & rt_hash_mask;

214 }</pre><hr />

As far as I understand it, the hashing is caclulated this way:

daddr AND 11110000111100001111000011110000 (0xF0F0F0F0), bitshifted 4 to the right,

OR

daddr AND 11110000111100001111000011110000 (0xF0F0F0F0), bitshifted 4 to the left;

resulting hash is XOR'd with (saddr XOR tos);

resulting hash is XOR'd with (hash bit-shifted 16 bits to the right);

resulting hash is XOR'd with (hash bit-shifted 8 bits to the right);

resulting hash is AND'd with rt_hash_mask and returned.

What I don't understand is where the rt_hash_mask comes from. I suspect it's from the code that sizes the hash table based on available memory, see here.

<font class="small">Code:</font><hr /><pre>2487 do {

2488 rt_hash_mask = (1UL << order) * PAGE_SIZE /

2489 sizeof(struct rt_hash_bucket);

2490 while (rt_hash_mask & (rt_hash_mask - 1))

2491 rt_hash_mask--;

2492 rt_hash_table = (struct rt_hash_bucket *)

2493 __get_free_pages(GFP_ATOMIC, order);

2494 } while (rt_hash_table == NULL && --order > 0);</pre><hr />

I understand that the hashing code was changed around 2.4.20 because the prior code could suffer from excess hash collisions, which is exactly what I'm seeing - terrible performance, "dst cache overflow" logged repeatedly and a loss of network connectivity. Unfortunately, I can't easily upgrade the kernel as it's a turnkey system with a custom-compiled kernel and I don't know what changes the software vendor has made. If I can figure out a short-term workaround (I'm thinking force garbage collection more often by reducing table size and decreasing gc timers and thresholds) then I'll pester the vendor to use a more modern kernel once I can prove the particular traffic profile the box handles is causing the problem (it's handling a lot of 10.x.x.x traffic with the same first three source octets and the same first and last destination octets).

Thanks for any advice or info.

Link to comment
Share on other sites

As far as I can see, it isn't possible to tune the routing cache table size.

As you probably saw, the size is derived from a function of the memory size in the box, so one option may be to add more RAM to the box. As far as I can make out, for every 64Mb of RAM, the route cache gets another page (4Kbytes).

I think you should be able to tinker with the GC settings - it may well help to modify ip_rt_gc_interval down from its default value (6000ms) to something nearer the minimum (which is 500ms).

Sorry, I don't know how to set kernel values on a Linux system as yet, but I'm sure there's an easy enough way.

HTH, Trev

Link to comment
Share on other sites

Thanks Trev, very useful.

So if the box is seeing traffic that causes lots of identical hash values, the linked list in the hash bucket is going to get very expensive to traverse for each network packet. From what you've seen, do you think lots of traffic from a few 10.x.x.x addresses to lots of of 10.x.x.15 addresses could cause this? My very basic manual calculations suggest it's possible to derive a lot of duplicate hashes with that traffic profile and this code.

My concern is that gc is an expensive process, and if I do it too often, I'm going to kill the performance if not the box just as surely as if I leave it as it is.

I can hardly believe this is a commercial product from a major player in the market segment, and it looks like it's possible to DoS it with a few hundred kb/s of legitimate traffic in the right environment.

Link to comment
Share on other sites

Yep, if there are lots of hash collisions, then the cache lookup functions are going to have to do more work traversing the linked list of route entries.

And you're right, GC would be quite expensive to do, so what you gain from clearing the cache, you would probably lose from doing GC more often.

If hash collisions are the problem, then adding more memory isn't going to help as you are still pretty likely (although not certainly) to see high collision rate.

Roughly how many different source and destination addresses are there ?

I've started a little test on my machine to see how efficient the hash algorithm is with source and dest addresses in the range 10.x.x.x. Unfortunately, it has to squirt through 16million*16million iterations, which I don't think will complete in my lifetime, so it'd be handy to be able to narrow it down a bit.

It seems odd that it would be so inefficient as to cause network probs though. What sort of box is it running on and how much RAM does it have in it?

I'll try to have a further look next week.

Link to comment
Share on other sites

You are a superstar. notworthy.gif

Most of the traffic comes from one of:

10.254.6.101-121

10.254.4.107

10.254.4.132-136

Destinations are 90% 10.x.x.15 with the second octet being evenly distibuted from 1 to 80, the 3rd octet is one of 33, 65, 129, 161, 193 or 225.

As you can see, a rather limited set of destinations. Basically, the box is tunneling traffic to all these destinations which are on remote sites with 32 networks of 256 addresses each, hence the fixed list of 6 possible 3rd octet values.

The box is a Compaq ML310 with P4 2.8Ghz, 512Mb RAM and U160 SCSI disk.

Link to comment
Share on other sites

Hey Chris,

I ran a little test today with that information.

According to my calculations, there are 4095 buckets in the hash table, and all of the combinations of source and destination address are spread evenly throughout the table. I didn't notice any hash buckets with more than 3 entries in them, although I didn't examine too closely.

So, it looks like the hash algorithm gives a fairly reasonable distribution given the small number of combinations of source/dest addresses (25 * 240).

I can't see this as giving the box too much hassle given that the hash algorithm is (computationaly) very straightforward, so it looks like the problem may be elsewhere.

I think it would probably be prudent to identify the location of the "dst cache overflow" message and work back from there.

Link to comment
Share on other sites

Thanks, that's not what I expected at all!

The "dst cache overflow" message seems to get logged when GC fails to reduce the table below the max desired size within it's configured timeout period.

I can't find any evidence to support a genuine traffic overload, hashing problems were kind of my last stab at working out what was going on. Still, this is a modified kernel based on RedHat 7.2/7.3, although I didn't think the modifications were that major, there may have been some changes to the networking code possibly.

Ah well, back to the drawing board. Thanks for your help in figuring out the code though. You are a gentleman & a scholar, sir.

Link to comment
Share on other sites

Chris, I think you're on the right track with a traffic overload.

It looks like a new routing entry is created each time a destination is requested - this has nothing to do (per-se) with the router cache itself, but seems to be just a list of destinations (held in ipv4_dst_ops). When the number of entries held in this list exceeds 65536 (for your machine), the "dst cache overflow" message is printed and there is some loss of network connectivity.

Since the size of the list is a function of the number of hash buckets in the router cache (hash buckets * 16), by adding more memory to the box (say another 512Mb), you can increase the cache size and therefore the maximum number of entries in the list. Double memory = double number of entries!

I don't know whether or not this will work on your machine, but check the file:

/proc/sys/net/ipv4/route/max_size - I understand that this is the maximum size of the routing table (although this may only be on newer versions). Simply increase (double would be a fair start) the value in that file and reboot.

Trev

Link to comment
Share on other sites

Cheers Trev! I've got one of the boxes "on order" from the remote site to get it into "Labs" here and have a play with it, see if i can break it.

I found this link about DoS attacks based on hash table complexity (or lack of). Interesting stuff, especially as Linux uses lots of hashed cache tables for all kinds of things - I wonder how many vulnerabilities there are waiting to be discovered. I can see Linux going more commercial (RedHat going to Enterprise only etc) and the number of security vulnerabilities increasing as it becomes more popular. Wonder if people will start switching back to Windows?

Link to comment
Share on other sites

I was discussing the commercial viability with some mates (who work in Solaris engineering for Sun) last week. They're obviously a little bigotted, but they see the main problem with commercial success being the way patches and releases are handled.

Specifically, if a (security) bug is found in Linux op Open Source in general, it will most likely be fixed within hours and a patch released. However, as has been shown to be the case regularly, the patch can break something else that was dependent on that functionality.

Sun's customers moan about how long it takes to qualify a patch before release, but I think they would moan a lot more if the patch they applied brought down their production server blush.gif.

I think we will see a steady increase in the number of virii targetted at Linux as Linux becomes more widespread, and it wouldn't surprise me if we come across some rather large security holes in Linux in the coming years. That said, I doubt people will migrate back to Windows from Linux.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...