模組:Language/name/data/ISO 639 synonym extraction tool

模块文档[创建]
您可能想要创建本Scribunto模块的文档。
编者可以在本模块的沙盒 (创建 | 镜像)和测试样例 (创建)页面进行实验。
请在/doc子页面中添加分类。本模块的子页面。
require('strict');
local p = {};

--[=[------------------------< I S O _ S Y N O N Y M _ E X T R A C T >-----------------------------------------

{{#invoke:Language/name/data/ISO 639 synonym extraction tool|ISO_synonym_extract|file-date=2013-01-11}}



reads a local copy of data from the table at http://www.loc.gov/standards/iso639-2/php/English_list.php, extracts
the ISO 639-2 (or 639-2T) codes that have equivalent ISO 639-1 codes and creates a table to translate 639-2 to 639-1.
ISO-639-3 uses 639-2T codes

useful lines in the source table have the form:
	<English name>\t<all English names>\t<all French names>\t<639-2 code>\t<639-1 code>\n
where:
	<English name> is primary English name (not used here); one of <all English names> so duplicates code listing
	<all English names> is all of the English names (not used here)
	<all French names> is all of the French names (not used here)
	<639-2 code> is the three-character ISO 639-2 or 639-2B/639-2T language code; when 639-2T present, use that code
	<639-1 code> is the two-character ISO 639-1 language code synonym of the -2 code (if one is defined)
		
	like this (with synonym):
		Abkhazian	Abkhazian	abkhaze	abk	ab
	or (without synonym):
		Achinese	Achinese	aceh	ace	 

for the file date use the date listed at the bottom of the source page in yyyymmdd numeric format without hyphens or spaces

]=]

function p.ISO_synonym_extract (frame)
	local page = mw.title.getCurrentTitle();									-- get a page object for this page
	local content = page:getContent();											-- get unparsed content
	local content_table = {};													-- table of text lines from source
	local split_table = {};														-- table of lines split at the tabs	
	local skip_table = {};														-- table of 636-2/639-2T codes that have been handled; used to prevent duplication
	local out_table = {};														-- output table
	
	local file_date = 'File-Date: ' .. frame.args["file-date"];					-- set the file date line from |file-date= (from the bottom of the source page)

	content_table = mw.text.split (content, '[\r\n]');							-- make a table of text lines
	for _, line in ipairs (content_table) do									-- for each line
		split_table = mw.text.split (line, '\t');								-- split at the table
		if split_table[5] and (' ' ~= split_table[5]) then						-- if there is a 639-1 code
			local code = split_table[4]:match ('%a+/(%a+)') or split_table[4];	-- when 639-2B/639-2T use 639-2T else use 639-2
			if not skip_table[code] then										-- skip if code already in the skip table because more than one language name
				skip_table[code] = true;										-- remember that we've handled this 636-2/639-2T code
				table.insert (out_table, "[\"" .. code .. "\"] = \"" .. split_table[5] .. "\"");		-- make new table entry
			end
		end
	end
	
	table.sort (out_table);
	
	return "<br /><pre>-- " .. file_date .. "<br />return {<br />&#9;" .. table.concat (out_table, ',<br />&#9;') .. "<br />&#9;}<br />" .. "</pre>";
end

return p;