您的当前位置:首页正文

1 Introduction Analysis of Japanese Compound Nouns

来源:画鸵萌宠网
AnalysisofJapaneseCompoundNouns

usingCollocationalInformation

Tokunaga,Takenobu

DepartmentofComputerScienceTokyoInstituteofTechnology

1Introduction

Analyzingcompoundnounsisoneofthecrucialis-suesfornaturallanguageprocessingsystems,inparticularforthesystemsthataimwidecoverageofdomains.Registeringallcompoundnounsindictio-nariesisobviouslyanimpracticalapproach,sincewecancreateanewcompoundnounbycombiningnouns.Therefore,amechanismtoinferstructuresofacompoundnounfromtheindividualnounsisnec-essary.

Inthispaper,weproposeamethodtoanalyzestructuresofJapanesecompoundnounsbyusingcollocationalinformationofwordsandathesaurus.Thecollocationalinformationisacquiredfromacor-pusoffourkanzicharacterwords.Foreachpossi-blestructureofacompoundnoun,thepreferenceiscalculatedbasedonthiscollocationalinformation.

1.toenumeratepossiblesegmentationsofanin-putcompoundnounbyconsultingheadwordsofthethesaurus.2.toassignthesauruscategoriestoeachword.3.tocalculatepreferencesofeverystructureofthecompoundnounaccordingtothefrequen-ciesofcategorycollocations.Weassumethatastructureofacompoundnouncanbeexpressedbyabinarytree.Wealsoassumethatthecategoryoftherightbranchofa(sub)treerepresentsthecategoryofthe(sub)treeitself.ThisassumptionisduetothatJapaneseisahead-finallanguage,namely,amodifiercomesontheleftofitsmodifiee.Withtheseassumptions,apreferencevalueofastructureiscalculatedbythefollowingre-cursivefunction.

2Acquisitionofcollocationalinformation

Theoutlineofprocedurestoacquirecollocationalin-formationconsistofthefollowingfoursteps:1.tocollectfourkanzicharacterwords.2.todividetheabovewordsinthemiddletohaveapairoftwokanzicharacterwords.Ifoneofthemisnotinthethesaurus,thisfourkanzicharacterwordisdiscarded.3.toassignthesauruscategoriestoeachtwokanzicharacterword.4.tocountoccurrencefrequenciesofthecate-gorycollocations.

ifisleaf

otherwise

wherefunctionandreturntheleftandrightsub-treeofthetreerespectively,returnsthesauruscategoriesoftheargument.Iftheargumentofisatree,catreturnsthecategoryoftherightmostleafofthetree.Functionreturnsanassociativ-itymeasureoftwocategories,whichiscalculatedfromthefrequencyofcategorycollocationdescribedintheprevioussection.Wewouldusefollowingtwomeasuresfor.

Probability:

Mutualinformationstatistics(MIS):

3Algorithm

Theanalysisconsistsofthreesteps:

ThisPostScriptversionwascreatedfromtheoriginalauthor’sEnglisharticlebytheJapaneseInformationSciencesProject(JISP),atNewYorkUniversity,incollaborationwiththeRWCP,aimingatworldwideaccesstotheinformation.Everyprecautionhasbeentakentoavoiderrorsarisingfromtheconversionofprinteddocumentstoelectronicform,however,shouldtherebeanydiscrepancies,theJISPbearssoleresponsibilityforthem.Emailaddress:www-admin@jisp.cs.nyu.edu,URLhttp://jisp.cs.nyu.edu/.

37

4Experiments

Weextractkanzicharactersequencesfromeditori-alsandcolumnsofnewspapersandtextsofanen-cyclopedia.

710compoundnounsconsistingoffivekanzichar-actersand789compoundnounsconsistingofsixkanzicharactersaremanuallyextractedfromthesetoftheabovekanzicharactersequences.Thesetwocollectionsofcompoundnounsareusedfortestdata.

Asathesaurus,weuseBunruiGoiHyou(BGH).BGHisstructuredasatreewhichhassixlevelsofhierarchy.Inthisexperiment,weusethecategoriesatlevel3.

Table1showstheresultoftheanalysisforfiveandsixkanzicharactersequences.“”meansthatthecorrectanswerisnotobtained.Thefirstrowshowsthepercentageofthecaseinwhichthecorrectan-swerisuniquelyidentified,namely,notie.Therest

”,showthepercentagethattheofrows,denoted“

correctanswerisincludedintheanswersrankedplaceandhigherthanit.

Table1:Accuracyofanalysis[%].

6kanzi

11234

6

5968919294

5

5368939498

因篇幅问题不能全部显示,请点此查看更多更全内容

Top