We have moved into the second half of our long-running series of what file-sharing studies really say. After everything we’ve seen so far, which includes some rather eye-opening perspectives, we decided to take a brief look at what could be the least disputed topic of file-sharing – that it is growing in popularity.
The study of the day goes by the name “A View of the Data on P2P File-sharing Systems”. It was published in 2008 by the Journal of the American Society for Information Science and Technology.
The study states the study of interest and method:
Our goal is to characterize the queries being issued and the data being shared in a representative file-sharing system, in this case, the Gnutella system”
We collected two types of data: query data, i.e., queries issued by users, and shared data, i.e., descriptions of file shared by users. Our goal is to collect a representative set of
data to yield a true picture of how P2P file-sharing systems are being used.
For the search query, the results were interesting. First, the study states:
There are more than 24 million queries in the September 2006 query log after preprocessing. The log was analyzed for type, length and term distribution. This information is useful in determining the desires and behavior of users.
The study went on to track what was used in searches for 2006, 2007, and 2008 and this graphic pretty much sums up the results:
For the pessimist within you, one could argue that, over the years, maybe users have become lazy in searching for content. People might just put in any old keyword, not really specify anything specific, and look on those basis alone. I would say that maybe users are looking at a general search term, then if they feel the results are too all over the place, they would start looking at something more specific. Really, though, there are a lot of ways one can look at these results. The study also looked at numerous aspects of the queries including query length, hourly queries and a whole lot more.
The study then looked at the contents of what was being shared. The study says:
Our shared data log comprises file information collected from 30,000 peers in September 2006. We use the content-based hash keys of files to uniquely identify instances
(replicas) of the same file. We classified files into types by their filename extensions (e.g., “exe” extensions signify applications and “pdf” extensions classify documents). Files
with unknown extensions are classified as “unknown.”
The findings were that a vast majority of the files shared were audio (24 Million) with “unknown” files coming in at a distant second (3.95 Million) and closing in at number three in most popular shared filetype was images (2.56 Million). For those wondering, video was the least popular kind of file (1.12 Million). The study went on to discuss replicated files, renamed files and similar topics.
The study then proceeded to make the following conclusions:
Our study revealed important characteristics of queries and shared data in Gnutella, one of today’s largest P2P file-sharing systems. It has been shownthat the data in the Gnutella system is extensive and distinct in character from that of the Web. Also, the use of P2P file sharing is increasing. Compared with the query rate of September 2006, the query rate in April 2007 is 80% greater and the query rate in September 2008 is 135% greater. This indicates the significance of the P2P file-sharing application, and thus the importance of effective search techniques for them.
I think that there is very little in the way of surprises in that file-sharing has increased in popularity. Still, that doesn’t make this study completely boring in that it does shed some interesting light on what specifically is being shared and how people are finding the content of their choice. Still, I think the sad element in all of this is that there really is no way to replicate the circumstances to gather reasonably accurate statistics like this in the future as many have gone to private file-sharing networks. What is being shared publicly now wouldn’t really reflect what is and isn’t necessarily popular and I don’t see how one could even reasonably offer accurate statistics now on the state of file-sharing.