i want to download all the rap loops from flashkit so i can do some nerdcore rap -- but the interface to download the loops from flashkit sucks badly.
now i could be a n00b and sit there for 1.5h and click madly like an idiot until i had downloaded all the files but im cool so i dont want to do that. i mean doing that would give me a headache on top of everything else, yknow what i mean?
so instead i analyze the site and find out that there is a pattern -- of course there is -- these pages are machine generated and that means there is a logical pattern. if you can figure out this pattern, deduce it, like sherlock holmes, then you can rip it
now im not going to bother to hide my tracks because (1) this is not malevolent (2) fuck flashkit if they try and prosecute me (3) i cant be bothered
the pattern i notice is that all the links to the loops are in a format like this: http://www.flashkit.com/loops/Rap/more46.php where 46 is the offset in the pagination of results. it goes from more2.php to more47.php and doesnt use 04 for 4 -- it just uses regular human numbers.
then to download it you have to hit the download link for that result which takes you to a download page -- and from there you can actually do the download. what a pain in the ass.
here are pictures to show you what i mean if you cant be bothered to visit the site (i dont blame you as the flashkit site is goddamn slow)
so you can see that the page of results links to pages that contain the ID in them, and notice that the download link to the wav contains the ID in it as well.
now this is where some knowledge and guesses come in (some would say a lil experience) i figure that maybe just maybe theyre using mod_rewrite because i happened to notice a link with square brackets in them and also links with front slashes in them -- and so i try the wav link http://www.flashkit.com/downloads/loops/wav/ID/ and lo and behold it works -- it does a 302 to the actual file location.
$ curl -I http://www.flashkit.com/downloads/loops/wav/4987/ HTTP/1.1 302 Found Date: Wed, 27 Jul 2005 00:18:47 GMT Server: Apache Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT Set-Cookie: s=8c5e4ae86e8ea42d6966f194cf3da51c; path=/ Pragma: no-cache Location: http://downloads.flashkit.com/loops/Rap/Whats_Cr-PlayaJay-4987/Whats_Cr-PlayaJay-wav-4987.zip Content-Type: text/html
so now we are home free. we just grab every page from more2.php to more47.php with a special case for the first page as it is named index.php and then we just parse out the IDs for the loops on that page, and then for each loop we download the wav zip.
once we got all the wav zips we can then just unzip them all and delete the zips -- and then copy the wav's to our windows computer and start making music.
im going to use php today on the unix command line.
for ($i=2; $i<=47; $i++) {
$url = "http://www.flashkit.com/loops/Rap/more$i.php";
echo $url . "\n";
exec("curl $url > aa-$i.txt");
}
so now we have aa-2.txt to aa-47.txt and you can download the index.php manually and save it as aa-1.txt so you then have aa-1.txt to aa-47.txt
so now for aa-*.txt we are going to parse out the IDs and stick em into a new file named out.txt (i am very original)
$IDz = array();
$arr = glob('aa-*.txt');
foreach($arr as $a) {
$c = explode("\n", file_get_contents($a));
foreach ($c as $line) {
if (preg_match('/jump.php\?type\=loops\&ID\=(.+?)\"/', $line, $match)) {
$ID = $match[1];
array_push($IDz, $ID);
}
}
}
$f = fopen('out.txt', 'w');
fwrite($f, implode("\n", $IDz));
fclose($f);
a huge benefit of doing this thing in chunks is that we download everything in one go first, and then we do our processing later -- because it will take a few tries to get it working just right -- if we didnt do this then we would be idiots to piss off their site admin, by using their cpu/memory/bw unnecessarily -- which might provoke banning of our IP or an abuse complaint the trick is to be light and fast, like an ordos raider/saboteur
so now we have out.txt filled with IDs of the files we want. i specifically want the wav's so using the pattern http://www.flashkit.com/downloads/loops/wav/ID/ does the trick.
now i make a new directory to contain the zipped wav downloads, enter that dir and execute the following
$IDz = explode("\n", file_get_contents('../out.txt'));
foreach ($IDz as $ID) {
$url = "http://www.flashkit.com/downloads/loops/wav/$ID/";
//echo $url . "\n";
if (!file_exists("$ID.zip")) {
exec("curl -L $url > $ID.zip");
sleep(5);
}
}
the -L to curl makes it follow the 302 redirect. i use an 'if' to check if the zip exists because i had to rerun the script because i stopped it after a bit to see what it was outputting. the check for an existing download is good because you dont repeat the downloads -- which would be very stupid. notice that i use sleep to pause the downloads for 5 seconds between each one -- so that i do NOT overruse their machines, leading them to have to block my IP -- use your head people
thats it. now just unzip all your wav's with the following shell code:
$ for fn in *.zip; do unzip $fn; doneyes! now transfer the hip-hop loops, the wav's, to your windows computer and start making awesome nerdcore!
tagged as hip hop, hiphop, hip-hop, nerdcore, rap, ripping, scraping, unix, flashkit, slashdot, web molesting, loops, acid pro, acid, music, wav