usingCollocationalInformation
Tokunaga,Takenobu
DepartmentofComputerScienceTokyoInstituteofTechnology
1Introduction
Analyzingcompoundnounsisoneofthecrucialis-suesfornaturallanguageprocessingsystems,inparticularforthesystemsthataimwidecoverageofdomains.Registeringallcompoundnounsindictio-nariesisobviouslyanimpracticalapproach,sincewecancreateanewcompoundnounbycombiningnouns.Therefore,amechanismtoinferstructuresofacompoundnounfromtheindividualnounsisnec-essary.
Inthispaper,weproposeamethodtoanalyzestructuresofJapanesecompoundnounsbyusingcollocationalinformationofwordsandathesaurus.Thecollocationalinformationisacquiredfromacor-pusoffourkanzicharacterwords.Foreachpossi-blestructureofacompoundnoun,thepreferenceiscalculatedbasedonthiscollocationalinformation.
1.toenumeratepossiblesegmentationsofanin-putcompoundnounbyconsultingheadwordsofthethesaurus.2.toassignthesauruscategoriestoeachword.3.tocalculatepreferencesofeverystructureofthecompoundnounaccordingtothefrequen-ciesofcategorycollocations.Weassumethatastructureofacompoundnouncanbeexpressedbyabinarytree.Wealsoassumethatthecategoryoftherightbranchofa(sub)treerepresentsthecategoryofthe(sub)treeitself.ThisassumptionisduetothatJapaneseisahead-finallanguage,namely,amodifiercomesontheleftofitsmodifiee.Withtheseassumptions,apreferencevalueofastructureiscalculatedbythefollowingre-cursivefunction.
2Acquisitionofcollocationalinformation
Theoutlineofprocedurestoacquirecollocationalin-formationconsistofthefollowingfoursteps:1.tocollectfourkanzicharacterwords.2.todividetheabovewordsinthemiddletohaveapairoftwokanzicharacterwords.Ifoneofthemisnotinthethesaurus,thisfourkanzicharacterwordisdiscarded.3.toassignthesauruscategoriestoeachtwokanzicharacterword.4.tocountoccurrencefrequenciesofthecate-gorycollocations.
ifisleaf
otherwise
wherefunctionandreturntheleftandrightsub-treeofthetreerespectively,returnsthesauruscategoriesoftheargument.Iftheargumentofisatree,catreturnsthecategoryoftherightmostleafofthetree.Functionreturnsanassociativ-itymeasureoftwocategories,whichiscalculatedfromthefrequencyofcategorycollocationdescribedintheprevioussection.Wewouldusefollowingtwomeasuresfor.
Probability:
Mutualinformationstatistics(MIS):
3Algorithm
Theanalysisconsistsofthreesteps:
ThisPostScriptversionwascreatedfromtheoriginalauthor’sEnglisharticlebytheJapaneseInformationSciencesProject(JISP),atNewYorkUniversity,incollaborationwiththeRWCP,aimingatworldwideaccesstotheinformation.Everyprecautionhasbeentakentoavoiderrorsarisingfromtheconversionofprinteddocumentstoelectronicform,however,shouldtherebeanydiscrepancies,theJISPbearssoleresponsibilityforthem.Emailaddress:www-admin@jisp.cs.nyu.edu,URLhttp://jisp.cs.nyu.edu/.
37
4Experiments
Weextractkanzicharactersequencesfromeditori-alsandcolumnsofnewspapersandtextsofanen-cyclopedia.
710compoundnounsconsistingoffivekanzichar-actersand789compoundnounsconsistingofsixkanzicharactersaremanuallyextractedfromthesetoftheabovekanzicharactersequences.Thesetwocollectionsofcompoundnounsareusedfortestdata.
Asathesaurus,weuseBunruiGoiHyou(BGH).BGHisstructuredasatreewhichhassixlevelsofhierarchy.Inthisexperiment,weusethecategoriesatlevel3.
Table1showstheresultoftheanalysisforfiveandsixkanzicharactersequences.“”meansthatthecorrectanswerisnotobtained.Thefirstrowshowsthepercentageofthecaseinwhichthecorrectan-swerisuniquelyidentified,namely,notie.Therest
”,showthepercentagethattheofrows,denoted“
correctanswerisincludedintheanswersrankedplaceandhigherthanit.
Table1:Accuracyofanalysis[%].
6kanzi
11234
6
5968919294
5
5368939498
因篇幅问题不能全部显示,请点此查看更多更全内容