linux-sound.org's data as csv was: Re: [linux-audio-dev] Linux soundapps pages updated

Jan Weil Jan.Weil at web.de
Mon Apr 19 17:51:12 UTC 2004


On Mon, 2004-04-12 at 18:29, Paul Winkler wrote:
> FYI, I'm still planning to implement my own proposal which has
> been discussed quite a lot in the L-A-U archives.
> I do somewhat similar sites for a living.  It just needs me to block out a 
> chunk of time (1 or 2 weekends) to bang it out.

Hi Paul,

I'd like to assist so I've written a little script to automatically
extract all the links from linux-sound.org.
It depends on Ruby, wget, lynx and sed.

The output is tab separated csv (tsv) containing three fields per row:
text, urls and category.
Some of the <li>s contain more than one url. For these the urls are
separated by blanks ' '.
The category is either the title (<h3>) of the subpage or the text which
belongs to the list item containing the current (<ul>).
The script expects at least one url from linux-sound.org as arguments
(one of the subpages). So you'll also need a working internet
connection.
A '-H' prints an additional csv header.
I also attached a bash script to extract all the subpages from
linux-sound.org.

If you have any problems with this script I can send you all the data
off list.

HTH,

Jan

P.S. Follow-up to LAU?
-------------- next part --------------
#!/usr/bin/env ruby

# This little piece of software is free in every sense of the word.
#  Mon, 19 Apr 2004, Jan Weil <jan.weil at web.de>


if ARGV.include?("-h") || ARGV.include?("--help") || ARGV.size == 0
	puts "usage: #{File.basename($0)} [-H] URL..."
	puts "-H --header\tadd csv header"
	exit
end

if ARGV.include?("-H") || ARGV.include?("--header")
	$print_header = true
	ARGV.delete("-H")
	ARGV.delete("--header")
end

def extract_urls(str)
	urls = []
	url_regex = /\[(\d+)\](\S.+)/
	loop do
		if str =~ url_regex
			urls.push($reference[$1.to_i])
			str.sub!(url_regex){|s| $2}
		else
			break
		end
	end
	if not urls.empty?
		return urls.join(" ")
	else
		return false
	end
end

def push_li(line, level, regex)
	next_line = ""
	loop do
		next_line = $lines.pop
		if next_line =~ regex
			line += " #{$1}"
		else
			break
		end
	end
	$lines.push(next_line)
	urls = extract_urls(line)
	$data.push({"text" => line, "urls" => urls, "cat" => $cat[level] || "None"}) if urls
	$cat[level+1] = line
end

ARGV.each do |url|
	$reference = []
	$cat = []
	$data = []

	# XXX this works, at least for linux-sound.org
	url =~ /(\w+\.\w+)$/
	loc = $1 or raise("Help me at XXX!")

	`wget #{url}`
	if $? != 0 
		exit 1
	end

	tmp = loc + ".dump"

	# unset locales (we need ^References$)
	ENV["LANG"] = "C"

	`lynx -dump #{loc} > #{tmp}`
	if $? != 0
		STDERR << "calling lynx failed! Is it installed?\n"
		exit 1
	end

	# extract link list (legend)
	out = `sed -n '/^References$/,$p' #{tmp} | sed -n '3,$p'`.split(/$/)
	if $? != 0
		STDERR << "calling sed failed! Is it installed?\n"
		exit 1
	end

	out.each do |line| 
		ary = line.split
		$reference[ary[0].to_i] = ary[1]
	end

	# extract data
	$lines = `sed -n '1,/^References$/p' #{tmp}`.split(/$/)

	File.delete(tmp)

	# we need a stack
	$lines.reverse!

	# traverse all lines
	loop do
		line = $lines.pop
		break if not line
		
		# title
		if line =~ /^  (\S.*)$/
			$cat[1] = $1
			next
		end
		
		# li level 1
		if line =~ /     \* (\S.*)$/
			line = $1
			push_li(line, 1, /^       (\S.*)$/)
			next
		end
		
		# li level 2
		if line =~ /          \+ (\S.*)$/
			line = $1
			push_li(line, 2, /^            (\S.*)$/)
			next
		end
		
		# li level 3
		if line =~ /^               o (\S.*)$/
			line = $1
			push_li(line, 3, /^                 (\S.*)$/)
			next
		end
		
		# li level 4
		if line =~ /^                    # (\S.*)$/
			line = $1
			push_li(line, 4, /^                      (\S.*)$/)
			next
		end
		
		# there is no higer level, right?
	end

	$data.sort! do |a, b| 
		if a["cat"] == b["cat"]
			ret = a["text"] <=> b["text"] 
		else 
			ret =  a["cat"] <=> b["cat"] 
		end
		ret
	end

	print "Text\tUrls\tCategory\n" if $print_header
	$data.each do |hash|
		print "#{hash['text']}\t#{hash['urls']}\t#{hash['cat']}\n"
	end
end
-------------- next part --------------
A non-text attachment was scrubbed...
Name: extract-all
Type: text/x-sh
Size: 479 bytes
Desc: not available
URL: <http://lists.linuxaudio.org/pipermail/linux-audio-dev/attachments/20040419/6b35d405/attachment.sh>


More information about the Linux-audio-dev mailing list