PDA

View Full Version : G2 / Gnutella comparison



John W. Lindh
February 23rd, 2003, 04:40 PM
I already posted this on BearShare.net, I think I'll just post it again, because I'm really, really fond of it.

If you see any mistakes (besides "but 100,000 G2 users is just 3,000 hubs not 5,000" or "but shareaza hubs only connect to 5 hubs not 6"), feel free to comment on it.

Note that I did not compare the current gnutella network to Shareaza. At the moment it's really hard to create any calculations for it, because of all the dropped queries / flow control (I would have to take into account what algorithms are used etc.). However it can hardly be as efficient as the new design LimeWire and BearShare have developed.


100,000 users querying every 10 minutes is 600,000 queries per hour or 166.7 queries per second. That's 16.7k/s (packet size 100bytes) if every query would just consist of one packet being sent, those 16.7 k/s would be distributed among all hubs, - at 100,000 users that would be maybe 5,000. That means the bandwidth requirement would be 3.33 bytes per hub and second.
BUT a sharaza servent won't just query a single hub, it will query let's say 400 (= 1330 bytes per hub and second). That's just incoming bandwidth.

A hub will forward all the queries it receive to 6 connected hubs (or was it just 5?) - thanks to intra-ultrapeer QRP we may assume that 50% will not be forwarded - so we are at 4k/s outgoing bandwidth for the queries. But there are still the leafs. Okay, - let's assume each hub queried so far has to forward the query to one leaf (effectively doubling the outgoing bandwidth requirement), so we are at 8k/s?

8k/s should not be too much of a problem, shouldn't it? However we have still to route the replies. Okay, let's say each query returns about 100 queryhits, that's about 60kbytes. Further, I will assume that only half of those queryhits have to be routed by hubs because a leaf or the querier is firewalled, in total adding 1k/s for the hub.
That's 9k/s outgoing bandwidth requirement for all hubs - what's the deal? Okay, G2 uses UDP, that means the hubs have to inform the whole network they exist and the hubs also have to tell the querier which hubs he just searched. That's another 2k/s (each hub has to reply to each query with the addresses of the hubs it is connected to plus some additional addresses to tell the querier where to search next, altogether 150 bytes per query is not too much for it). Making it 11k/s. Assuming that the hubs have to do some querying for firewalled leafs - that would make it about 12k/s.

Let's compare how TCP performs. To query the same number of hosts (in the example above 400 hubs were queried directly - thats about 2200 hubs effectively plus some duplicate queries).
I will use a ttl of 3 at an outdegree of 16 (= number of ultrapeer connections per ultrapeer). In addition I will use intra-ultrapeer qrp. The calculation is a bit tricky but about 4000 query packets will have to be sent to reach the same number of leafs (only 2048 ultrapeers will be queried because 50% of the queries at last hop can be dropped, and each of them has a leaf that has to be queried, - just like in the G2 example). That makes it ~13k/s for every ultrapeer. In addition all the replies have to be routed. 60kbytes per query that have to be routed over an average of 3 hops = an additional 6k/s for every ultrapeer. That means the same task requires 19k/s when using the improved gnutella protocol.

So G2 rules? - Not exactly. Let's say you want to search a single file (with a known urn) in a large network - sort of global search like (that's the original intention behind G2). 2,000,000 users, 100,000 hubs - and you want to search 50% of them. You will have to query 11,000 hubs for that, - that's 1.1MBytes in outgoing query traffic for the querier and another 1.7Mbytes for the network (to inform you which hubs you searched and where to search next).
With TCP you could assume a network using ttl 4, outdegree 20 it's merely outgoing bandwidth 762kbytes for the whole network and 100bytes outgoing bandwidth for the querier. It's true that this number is only as low as it is because of the high outdegree. What if Shareaza hubs had 20 connections to other hubs? That would reduce the hubs the querier had to search himself to ~3,400 or 340kbytes outgoing bandwidth. The outgoing bandwidth requirement for the rest of the network, however will still be as high as before or only slightly lower: (The hubs have to tell the querier still about the addresses of the 50,000 hubs that he searched plus where to search next (also about 50,000 addresses). At 10bytes per address it won't be any lower than 1Mbyte for the network (it'll probably be 30%-50% higher because of some redundancies).

G2 is good for small networks and short queries, but when you want to search a really big network for really rare files, it's not as efficient as Gnutella can be.

method77
February 23rd, 2003, 04:54 PM
The only thing I know about G2 and Gnutella is that I don't get any results from any of them!

isus
February 23rd, 2003, 05:33 PM
Originally posted by method77
The only thing I know about G2 and Gnutella is that I don't get any results from any of them!

damn straight. :mellow

FrYGuY
February 24th, 2003, 05:47 AM
Here's a rather glaring one.

If a hub forwards it to 1 leaf, there's pretty much no chance it will have to forward 100 hits, plus, though I'm not sure, I believe the leaf might forward the hit directly the the querier, thus negating that need for the hub's bandwidth. (Before you say anything about 'popular files', it's been said many times before that query hits of popular files have a much higher chance of getting dropped, wheras a rare file's query-hit has an almost nil chance. Advanced flow-control, if you will)

Also, throw into account that all links within Gnutella 2 are compressed.

Here's another: the TTL for the high outdegree Gnutella structure will be 4, not 3... ("I will use a ttl of 3 at an outdegree of 16 (= number of ultrapeer connections per ultrapeer). ") Also don't forget that the high-outdegree peers will be put into a network where TTL = 7 queries are common, so much of the bandwidth will be eaten up by them.

There are probably more, but I've gotta run to Class...

John W. Lindh
February 24th, 2003, 06:21 AM
Originally posted by FrYGuY
Here's a rather glaring one.

If a hub forwards it to 1 leaf, there's pretty much no chance it will have to forward 100 hits, plus, though I'm not sure, I believe the leaf might forward the hit directly the the querier, thus negating that need for the hub's bandwidth.
I assumed that 100 hits would be the number of hits returned by the search in total, after querying 400 hubs * 3 leafs.
I took the possibility of the leaf returning the hit directly into consideration, but I assumed that in 50% of all cases either the queried leaf or the querier is firewalled (I read somewhere that actually about 40% of gnutella are firewalled and I thought this would be an optimistic assumption), so one hub would have to relay a hit.


(Before you say anything about 'popular files', it's been said many times before that query hits of popular files have a much higher chance of getting dropped, wheras a rare file's query-hit has an almost nil chance. Advanced flow-control, if you will)
Flow control is not possible in a network relying on UDP because the hubs can not keep count of the number of queries a single non-firewalled leaf is sending. If a bad client entered the network there would not be much that could keep him from querying many more hosts than necessary.


Also, throw into account that all links within Gnutella 2 are compressed.
I thought about it and I decided not to take it into consideration because there are gnutella clients (Swapper, gtk-gnutella but LimeWire is also using some form of compression) having compression as well and if both networks used compression the advantage for G2 would be zero. In addition I have no hard numbers on how well this compression behaves.


Here's another: the TTL for the high outdegree Gnutella structure will be 4, not 3... ("I will use a ttl of 3 at an outdegree of 16 (= number of ultrapeer connections per ultrapeer). ")
TTL 4 was chosen by the developers for a network with 20,000 ultrapeers, my example was for a network of 5,000 ultrapeers. The developers would certainly optimize TTL and outdegree for the current network size. It would be even possible to adjust TTL & outdegree dynamically (in each client) to the number of duplicate queries a client receives.


Also don't forget that the high-outdegree peers will be put into a network where TTL = 7 queries are common, so much of the bandwidth will be eaten up by them.
High outdegree peers will normalize the TTL of low-outdegree queries and drop queries with high TTL's. In addition they will prefer connecting to each other so we will have in fact a high outdegree network loosely connected to the low degree network and dropping most of the queries from the low degree network. - At least that's what LimeWire's current implementation suggests.


There are probably more, but I've gotta run to Class...
Keep 'em coming.

FrYGuY
February 24th, 2003, 08:42 AM
Originally posted by John W. Lindh
Flow control is not possible in a network relying on UDP because the hubs can not keep count of the number of queries a single non-firewalled leaf is sending. If a bad client entered the network there would not be much that could keep him from querying many more hosts than necessary.

Not exactly true. If the Hub does forward queries for the leaves in the case of firewalled clients, if there are a good number of hits from the collective leafs, then odds are a good number of them will be dropped, because it's a popular file.

If, on the other hand, a query comes in, and of the 200 or so leaves, one responce is generated, the hub will be able to figure out that from that query, very few results were generated, and will place it higher in the stack in the outgoing packets, though there is still the possibility, although greatly reduced, that it will be dropped.

Just got back from class, so I'll read your post again and look for more possible errors...

FrYGuY
February 24th, 2003, 09:12 AM
Here's one...

In your post you said that a network of 2,000,000 users would be 100,000 hubs. The entire G2 protocol was written to make leafs as minimal in cost as possible making leaf density higher than in Gnutella v0.6, allowing for hubs election to be even more strict meaning that even better machines are the ones becoming hubs, making leaf density even higher. Hence, the plan is to have leaf densities of at LEAST 200 leaves/hub.
Your example is 1 hub in 20 users. Instead, it should be 1 hub per 300 users (Average between 200 and 400, 400 being the default setting for leaves allowed on a hub). And the hub count drops DRASTICALLY to 6,700. Up it to 400, and 5000 becomes the hub count. I've even seen hubs of over 400 (though they aren't common).

So instead of 1.1 Mb, it's 500Kb. out, 750Kb in. And keep in mind, taking an iterative walker approach, a query in a network this size would take a while, so that it become possible, even to a 56K user, if it's spread out over a minute, which is still a relatively short time.

[Edit: Running the numbers through My head again, I forgot the redundant connections: 2 hubs per leaf. Meaning Actual Leaf Density drops to 100-200. Still, quite different from the 10 leafs/hub in the original calulations]

Sephiroth
February 24th, 2003, 09:40 AM
Originally posted by John W. Lindh
Flow control is not possible in a network relying on UDP because the hubs can not keep count of the number of queries a single non-firewalled leaf is sending. If a bad client entered the network there would not be much that could keep him from querying many more hosts than necessary.


Which is why some are very skeptical/cautious in using UDP in P2P networks. Plus the security risks and other programs that can occur like some hosts could get DDOSed. Especially when the anti-p2p companies start spamming on MP like they do on gnutella. Which it will be interesting to see if MP can handle it.

Which might be the reason why the specs arent out probably because of some issues havent been resolved.

The query rate is much higher. I think queries make up around 70% i think it might be high of total traffic on gnutella. So 6 queries per hour is pretty low. If it was based on how many searches a user initiates then maybe depends on how many researches they do but downloads get queried on startup and each 45min-hour afterwards.

FrYGuY
February 24th, 2003, 09:45 AM
Originally posted by Sephiroth
Which is why some are very skeptical/cautious in using UDP in P2P networks. Plus the security risks and other programs that can occur like some hosts could get DDOSed. Especially when the anti-p2p companies start spamming on MP like they do on gnutella. Which it will be interesting to see if MP can handle it.

Which might be the reason why the specs arent out probably because of some issues havent been resolved.

Note that Gnutella 2 uses 'Query Keys', making a DDOS attack much harder (near impossible) to accomplish. It was even announced that Query Keys were employed in the Gnutella 2 specs before GUESS was discussing it's possible need.

Sephiroth
February 24th, 2003, 10:46 AM
Originally posted by FrYGuY
Note that Gnutella 2 uses 'Query Keys', making a DDOS attack much harder (near impossible) to accomplish. It was even announced that Query Keys were employed in the Gnutella 2 specs before GUESS was discussing it's possible need.

No it doesnt, it prevents MP from being used to launch DDOS attacks on other computers on the internet it doesnt protect from queries being spammed on the network and servents from being spoofed.

Maybe i should explain it a little more Im talking about because there is no flow control in UDP and the network spammers who will spam queries the same kind as everyone else does that the traffic that will create might be enough to knock off some "hubs" in a DDOS manner.

John W. Lindh
February 24th, 2003, 11:11 AM
Originally posted by FrYGuY In your post you said that a network of 2,000,000 users would be 100,000 hubs. The entire G2 protocol was written to make leafs as minimal in cost as possible making leaf density higher than in Gnutella v0.6, allowing for hubs election to be even more strict meaning that even better machines are the ones becoming hubs, making leaf density even higher. Hence, the plan is to have leaf densities of at LEAST 200 leaves/hub.
This is a valid argument, - however Ultrapeer/Hub-election rules were so far not part of the protocol. Since there are also gnutella clients allowing for much more than 60 leafs (that's the number I worked with) I figured, it would be the best approach to assume the same leaf/ultrapeer density for Gnutella and G2.
So, let's see if this changes anything if Shareaza indeed had a higher leaf-hub density than a possible Gnutella client. 100,000 users = 1,500 hubs. To search the same amount of hosts, each user who searched 400 hubs before would now just search 120.

I'm encountering the first problem, - the intra-hub QRP is now less effective. I have to adjust it some how. Since I made a (maybe pessimistic) assumption of only 50% dropped queries because of QRP, I would now have 10% dropped queries (basic stochastics). Okay, I'm going to leave it 50% here and adjust it for Gnutella later.

- 120 * 100,000 * 6 / 3600 query packets (100 bytes again) per second that have to be forwarded to 3 hubs and 3 leafs = 18,000k/s. (10 leafs because I assume that over all the same amount of leafs have to be queried as before).
- Each search returns 60k query hits, 30k have to be routed by hubs = 5,000k/s
- Informing the about the 120 * 6 hubs (10 bytes per address sent) he searched indirectly plus where to search next (= probably another 120*6 addresses, possibly lower but it does not really matter anyway): 120 * 6 * 2 * 10bytes * 100,000 users * 6 searches per hour / 3600 s= 2,400 k/s
Total: 25,400k/s
per hub: 17k/s (only outgoing bandwidth).

Conclusion: In that case you should consider reducing your requery rate ....

Now Gnutella with a lower leaf/ultrapeer density (60/up, 3 ultrapeer connections per leaf with adjusted efficiency of intra-ultrapeer QRP, 80% instead of 50%).
Most remains as before, re-running my little calculator tool I'm now not sure anymore if I wasn't a little too pessimistic about Gnutella before:
3047 queries have to be sent to ultrapeers, 2790 of them benefit from QRP (because Hops = TTL) => only 558 of those last 2790 will really be sent (= 815 queries in total). About 612 ultrapeers will really be queried because of duplicates. I'll query the same amount of leafs as before (=360 I think I made a mistake here before, I actually queried a lot more gnutella leafs as if G2 would not use more or less the same type of QRP). So all in all 972 queries have to be sent.
100,000 users * 972 queries * 100 bytes * 6 (queries per hour) / 3600 s = 16,200k/s

How come this value is lower than G2's? Simple: The higher the outdegree, the more ultrapeers can benefit from intra-ultrapeer QRP. If G2 used the same outdegree it would certainly look a little better (maybe having the same bandwidth requirement as Gnutella).

I did not make a mistake when calculating the bandwidth needed for routing replies. It's still 6 times as high as G2's (because ALL queryhits have to be routed by ultrapeers and the average reply path is 3 hops instead of one).
Query Hits: ~30,000k/s

Total: 46,200k/s
per Ultrapeer: 9.24k/s (remember, we have more ultrapeers - a highler load does not matter as much if it is distributed among more ultrapeers).

Gnutella with 200 leafs per ultrapeer would need less bandwidth for queries (about half of it, working almost twice as well as G2 because of the high outdegree) but the traffic for routing hits remains the same, causing some 25-26k/s of bandwidth requirement per ultrapeer. Of course, the selective querying algorithms (adjusting the TTL dynamically and currently being developed by LW & BS) could return some savings as far as the averag reply path is concerned, possibly reducing 20k/s.

Note that not all queries return as many results and if there are many more urn queries than fuzzy keyword searches, those numbers would be crap anyway.

----
In a large network with almost no queryhits that have to be sent, the higher leaf density would not make much of a difference. Assuming a 6,000 hub G2 (200 leafs/hub) network equals a 20,000 ultrapeer Gnutella network (60 leafs / ultrapeer) we will get the following picture when trying to reach half of the network and receiving a negligible amount of queryhits (maybe 3 or 4, this is an urn query):

G2: 64.6kb outgoing bandwidth for the querier and I would assume some 80kb for the rest of the network (sending him the addresses of 3000 queried hubs, plus where to query next, plus quite a numbr of duplicates, 8,000 addresses sent in total 10 bytes per address not even starting to account for query keys). 13,3 bytes per hub.

Gnutella: 100 bytes for the querier, 1,100 queries sent in total = 110kb for the network distributed among 3.3 times as many ultrapeers. Per ultrapeer: 5.5bytes.


Your example is 1 hub in 20 users. Instead, it should be 1 hub per 300 users (Average between 200 and 400, 400 being the default setting for leaves allowed on a hub). And the hub count drops DRASTICALLY to 6,700.

Up it to 400, and 5000 becomes the hub count. I've even seen hubs of over 400 (though they aren't common).

So instead of 1.1 Mb, it's 500Kb. out, 750Kb in. And keep in mind, taking an iterative walker approach, a query in a network this size would take a while, so that it become possible, even to a 56K user, if it's spread out over a minute, which is still a relatively short time.
A couple of remarks:
6,700 hubs instead of 100,000 decreases the outgoing bandwidth requirement for the querier to 73kb, not 750kb. The problem is: If only 10% of the addresses have to be sent by 10% of the hubs, the traffic per hub remains the same. - And as I demonstrated Gnutella could still handle almost twice as many queries like that compared to G2. Even if you adjusted the outdegree, which would improve the situation for the querier, it would not change a thing for the hubs which would still have to return more or less the same number of addresses.

If Gnutella adjusted its leaf density this would not affect the bandwidth requirement per ultrapeer substantially, simply because - thanks to TCP - it doesn't have to bother telling the querying client which ultrapeers he has searched and where he should search next.

For urn searches or for extremely rare search keywords, using TCP connections like Gnutella can prove more efficient than the use of UDP like G2.

John W. Lindh
February 24th, 2003, 11:27 AM
Originally posted by FrYGuY
[Edit: Running the numbers through My head again, I forgot the redundant connections: 2 hubs per leaf. Meaning Actual Leaf Density drops to 100-200. Still, quite different from the 10 leafs/hub in the original calulations]
Correction: It was 20 leafs per hub, - assuming 3 hubs per leaf. But I demonstrated that the advantage of having a high leaf-density is small. It simply means that fewer hubs have to do the same amount of work.

FrYGuY
February 24th, 2003, 12:37 PM
Just got back from another class. My brain is dead for the day. I'll look at it tomorrow, or maybe later tonight. and get back to you.

Sephiroth
February 24th, 2003, 12:49 PM
Originally posted by FrYGuY
Just got back from another class. My brain is dead for the day. I'll look at it tomorrow, or maybe later tonight. and get back to you.

Thats why i skip... ;)

FrYGuY
February 25th, 2003, 01:17 PM
Here's something which affects the calculations, though it's not exactly a mistake.

Shareaza uses either 2^19 or 2^20 QRP tables and have an infinity value of 1 so each entry requires 1 bit, meaning the number of non-matching queries which will hit the leaf are very minimal.

Bad side is that such large tables are memory intensive. 2^16, or 65536 bits per leaf (8192 bytes/leaf), and at an average of 400 leaf, thats 26,214,400 bits/leaf and 3,276,800 bytes/leaf of RAM. Fortunantly, with a reduced need for hubs, setting a higher RAM requirement for a hub is feasible.

While you are correct in assuming that High Outdegree allows theoretically for more beneficial QRP tables, most Gnutella v0.6 clients are only using tables 2^16 in size (65535), thus, ironically benefitting LESS from the QRP tables.

John W. Lindh
February 25th, 2003, 03:59 PM
Most Gnutella servents should support QRP v0.2 by now, which uses 1-bit entries (infty 1). To save space Shareaza could use dynamically sized QRTs like e.g. gtk-gnutella already does and LimeWire is going to implement.

The number of false positives depends on how many files the users are sharing. If users share only few hundred files, the number of false positivies is neglegible.

FrYGuY
February 25th, 2003, 05:00 PM
Originally posted by John W. Lindh
Most Gnutella servents should support QRP v0.2 by now, which uses 1-bit entries (infty 1). To save space Shareaza could use dynamically sized QRTs like e.g. gtk-gnutella already does and LimeWire is going to implement.

I believe this is already the case, with either 2^19 or 2^20 QRP tables. GTK-g uses a table size default at 2^19 to a maximum of 2^21 now, I believe.

With 1558 files shared currently (On both Gnutella 2 and Gnutella v0.6), I'm averaging a little under a 1KBps in queries, so I'd say I'm fairly well shielded from query hits.

dock0184
August 2nd, 2004, 11:39 AM
So what G2 HUB really needs is to create a database of it's leaves' shared files so it doesn't have to send query to each leaf.

thongsai
August 5th, 2004, 10:12 AM
well i think g1 is as good as g2. but i will never get to find out cuz the two main players want us to buy their p2p app to get global searching..