let textEN = "The quick brown fox jumps over the lazy dog" let textES = "El zorro marrón rápido salta sobre el perro perezoso" let textAR = "الثعلب البني السريع يقفز فوق الكلب الكسول" let textDE = "Der schnelle braune Fuchs springt über den faulen Hund"
我想检测每个声明的字符串中使用的语言.
让我们假设已实现函数的签名是:
func detectedLangauge<T: StringProtocol>(_ forString: T) -> String?
如果没有检测到语言,则返回可选字符串.
因此,适当的结果将是:
let englishDetectedLangauge = detectedLangauge(textEN) // => English let spanishDetectedLangauge = detectedLangauge(textES) // => Spanish let arabicDetectedLangauge = detectedLangauge(textAR) // => arabic let germanDetectedLangauge = detectedLangauge(textDE) // => German
有一个简单的方法来实现它吗?
解决方法
从iOS 11开始,您可以使用NSLinguisticTagger实现它.实现如下所需的功能:
func detectedLangauge<T: StringProtocol>(_ forString: T) -> String? { guard let languageCode = NSLinguisticTagger.dominantLanguage(for: String(forString)) else { return nil } let detectedLangauge = Locale.current.localizedString(forIdentifier: languageCode) return detectedLangauge }
应该达到你所要求的.
描述答案:
首先,你应该知道你所询问的主要是与Natural language processing (NLP)的世界有关.
由于NLP不仅仅是文本语言检测,因此答案的其余部分将不包含特定的NLP信息.
显然,实现这样的功能并不是那么容易,特别是当开始关注过程的细节时,例如分成句子甚至是单词,在识别名称和标点符号之后……我打赌你会想到“什么一个痛苦的过程!我自己做这件事并不合乎逻辑“;幸运的是,iOS确实支持NLP(实际上,NLP API可用于所有Apple平台,而不仅仅是iOS),以实现您希望易于实现的目标.您将使用的核心组件是NSLinguisticTagger
:
Analyze natural language text to tag part of speech and lexical class,
identify names,perform lemmatization,and determine the language and
script.
NSLinguisticTagger
provides a uniform interface to a variety of
natural language processing functionality with support for many
different languages and scripts. You can use this class to segment
natural language text into paragraphs,sentences,or words,and tag
information about those segments,such as part of speech,lexical
class,lemma,script,and language.
正如课程文档中所提到的,您正在寻找的方法 – 在确定主导语言和正字法部分 – 是dominantLanguage(for:)
:
Returns the dominant language for the specified string.
.
.
Return Value
The 07004 tag identifying the dominant language of the string,or the
tag “und” if a specific language cannot be determined.
您可能会注意到NSLinguisticTagger自从回到iOS 5后就存在了.但是,dominLanguage(for :)方法仅支持iOS 11及更高版本,因为它是在Core ML Framework之上开发的:
. . .
Core ML is the foundation for domain-specific frameworks and
functionality. Core ML supports Vision for image analysis,Foundation
for natural language processing (for example,theNSLinguisticTagger
class),and GameplayKit for evaluating learned decision trees. Core ML
itself builds on top of low-level primitives like Accelerate and BNNS,
as well as Metal Performance Shaders.07006
根据调用dominantLanguage(for :)的返回值,通过“快速的棕色狐狸跳过懒狗”:
NSLinguisticTagger.dominantLanguage(for: "The quick brown fox jumps over the lazy dog")
将是“en”可选字符串.然而,到目前为止,这不是理想的输出,期望是获得“英语”!好吧,这正是你应该从Locale结构调用localizedString(forLanguageCode:)
方法并传递得到的语言代码:
Locale.current.localizedString(forIdentifier: "en") // English
全部放在一起:
正如“快速回答”代码段中所述,该函数将是:
func detectedLangauge<T: StringProtocol>(_ forString: T) -> String? { guard let languageCode = NSLinguisticTagger.dominantLanguage(for: String(forString)) else { return nil } let detectedLangauge = Locale.current.localizedString(forIdentifier: languageCode) return detectedLangauge }
输出:
这将是预期的:
let englishDetectedLangauge = detectedLangauge(textEN) // => English let spanishDetectedLangauge = detectedLangauge(textES) // => Spanish let arabicDetectedLangauge = detectedLangauge(textAR) // => arabic let germanDetectedLangauge = detectedLangauge(textDE) // => German
注意:
仍然存在无法获取给定字符串的语言名称的情况,例如:
let textUND = "SdsOE" let undefinedDetectedLanguage = detectedLangauge(textUND) // => UnkNown language
或者甚至可能是零:
let rabish = "000747322" let rabishDetectedLanguage = detectedLangauge(rabish) // => nil
仍然发现提供有用的输出是一个不错的结果……
此外:
关于NSLinguisticTagger:
虽然我不打算深入研究NSLinguisticTagger的用法,但我想指出,它中存在一些非常酷的功能,而不仅仅是检测给定文本的语言;作为一个非常简单的例子:在使用Information retrieval时,在枚举标签时使用引理会非常有用,因为您可以识别“驱动”一词传递“驱动”字.
官方资源
Apple视频会话:
>有关自然语言处理以及NSLinguisticTagger如何工作的更多信息:Natural Language Processing and your Apps.
另外,为了熟悉CoreML:
> Introducing Core ML
> Core ML in depth