linux-sound.org's data as csv was: Re: [linux-audio-dev] Linux soundapps pages updated
Jan Weil
Jan.Weil at web.de
Mon Apr 19 17:51:12 UTC 2004
On Mon, 2004-04-12 at 18:29, Paul Winkler wrote:
> FYI, I'm still planning to implement my own proposal which has
> been discussed quite a lot in the L-A-U archives.
> I do somewhat similar sites for a living. It just needs me to block out a
> chunk of time (1 or 2 weekends) to bang it out.
Hi Paul,
I'd like to assist so I've written a little script to automatically
extract all the links from linux-sound.org.
It depends on Ruby, wget, lynx and sed.
The output is tab separated csv (tsv) containing three fields per row:
text, urls and category.
Some of the <li>s contain more than one url. For these the urls are
separated by blanks ' '.
The category is either the title (<h3>) of the subpage or the text which
belongs to the list item containing the current (<ul>).
The script expects at least one url from linux-sound.org as arguments
(one of the subpages). So you'll also need a working internet
connection.
A '-H' prints an additional csv header.
I also attached a bash script to extract all the subpages from
linux-sound.org.
If you have any problems with this script I can send you all the data
off list.
HTH,
Jan
P.S. Follow-up to LAU?
-------------- next part --------------
#!/usr/bin/env ruby
# This little piece of software is free in every sense of the word.
# Mon, 19 Apr 2004, Jan Weil <jan.weil at web.de>
if ARGV.include?("-h") || ARGV.include?("--help") || ARGV.size == 0
puts "usage: #{File.basename($0)} [-H] URL..."
puts "-H --header\tadd csv header"
exit
end
if ARGV.include?("-H") || ARGV.include?("--header")
$print_header = true
ARGV.delete("-H")
ARGV.delete("--header")
end
def extract_urls(str)
urls = []
url_regex = /\[(\d+)\](\S.+)/
loop do
if str =~ url_regex
urls.push($reference[$1.to_i])
str.sub!(url_regex){|s| $2}
else
break
end
end
if not urls.empty?
return urls.join(" ")
else
return false
end
end
def push_li(line, level, regex)
next_line = ""
loop do
next_line = $lines.pop
if next_line =~ regex
line += " #{$1}"
else
break
end
end
$lines.push(next_line)
urls = extract_urls(line)
$data.push({"text" => line, "urls" => urls, "cat" => $cat[level] || "None"}) if urls
$cat[level+1] = line
end
ARGV.each do |url|
$reference = []
$cat = []
$data = []
# XXX this works, at least for linux-sound.org
url =~ /(\w+\.\w+)$/
loc = $1 or raise("Help me at XXX!")
`wget #{url}`
if $? != 0
exit 1
end
tmp = loc + ".dump"
# unset locales (we need ^References$)
ENV["LANG"] = "C"
`lynx -dump #{loc} > #{tmp}`
if $? != 0
STDERR << "calling lynx failed! Is it installed?\n"
exit 1
end
# extract link list (legend)
out = `sed -n '/^References$/,$p' #{tmp} | sed -n '3,$p'`.split(/$/)
if $? != 0
STDERR << "calling sed failed! Is it installed?\n"
exit 1
end
out.each do |line|
ary = line.split
$reference[ary[0].to_i] = ary[1]
end
# extract data
$lines = `sed -n '1,/^References$/p' #{tmp}`.split(/$/)
File.delete(tmp)
# we need a stack
$lines.reverse!
# traverse all lines
loop do
line = $lines.pop
break if not line
# title
if line =~ /^ (\S.*)$/
$cat[1] = $1
next
end
# li level 1
if line =~ / \* (\S.*)$/
line = $1
push_li(line, 1, /^ (\S.*)$/)
next
end
# li level 2
if line =~ / \+ (\S.*)$/
line = $1
push_li(line, 2, /^ (\S.*)$/)
next
end
# li level 3
if line =~ /^ o (\S.*)$/
line = $1
push_li(line, 3, /^ (\S.*)$/)
next
end
# li level 4
if line =~ /^ # (\S.*)$/
line = $1
push_li(line, 4, /^ (\S.*)$/)
next
end
# there is no higer level, right?
end
$data.sort! do |a, b|
if a["cat"] == b["cat"]
ret = a["text"] <=> b["text"]
else
ret = a["cat"] <=> b["cat"]
end
ret
end
print "Text\tUrls\tCategory\n" if $print_header
$data.each do |hash|
print "#{hash['text']}\t#{hash['urls']}\t#{hash['cat']}\n"
end
end
-------------- next part --------------
A non-text attachment was scrubbed...
Name: extract-all
Type: text/x-sh
Size: 479 bytes
Desc: not available
URL: <http://lists.linuxaudio.org/pipermail/linux-audio-dev/attachments/20040419/6b35d405/attachment.sh>
More information about the Linux-audio-dev
mailing list