Title: | R Interface to 'Sudachi' |
---|---|
Description: | Interface to 'Sudachi' <https://github.com/WorksApplications/sudachi.rs>, a Japanese morphological analyzer. This is a port of what is available in Python. |
Authors: | Shinya Uryu [aut, cre] , Akiru Kato [aut] |
Maintainer: | Shinya Uryu <[email protected]> |
License: | Apache License (>= 2.0) |
Version: | 0.1.0.9007 |
Built: | 2024-11-24 03:09:24 UTC |
Source: | https://github.com/uribo/sudachir |
Create a list of tokens
as_tokens(tbl, type, pos = TRUE, ...)
as_tokens(tbl, type, pos = TRUE, ...)
tbl |
A data.frame of tokens out of |
type |
Preference for the format of returned tokens. Pick one of "surface", "dictionary", "normalized", or "reading". |
pos |
When passed as |
... |
Passed to |
A named list of character vectors.
## Not run: tokenize_to_df("Tokyo, Japan") |> as_tokens(type = "surface") ## End(Not run)
## Not run: tokenize_to_df("Tokyo, Japan") |> as_tokens(type = "surface") ## End(Not run)
Create virtualenv env used by sudachir
create_sudachipy_env(python_version = "3.9.12")
create_sudachipy_env(python_version = "3.9.12")
python_version |
Python version to use within the virtualenv environment created. SudachiPy requires Python 3.6 or higher to install. |
Get dictionary's features
dict_features(lang = c("ja", "en"))
dict_features(lang = c("ja", "en"))
lang |
Dictionary features label; one of "ja" or "en". |
dict_features("en")
dict_features("en")
This function is a shorthand of tokenize_to_df() |> as_tokens()
.
form( x, text_field = "text", docid_field = "doc_id", instance = rebuild_tokenizer(), ... )
form( x, text_field = "text", docid_field = "doc_id", instance = rebuild_tokenizer(), ... )
x |
A data.frame like object or a character vector to be tokenized. |
text_field |
Column name where to get texts to be tokenized. |
docid_field |
Column name where to get identifiers of texts. |
instance |
A binding to the instance of |
... |
Passed to |
A named list of character vectors.
## Not run: form( "Tokyo, Japan", type = "surface" ) ## End(Not run)
## Not run: form( "Tokyo, Japan", type = "surface" ) ## End(Not run)
Install SudachiPy to virtualenv virtual environment.
As a one-time setup step, you can run
install_sudachipy()
to install all dependencies.
install_sudachipy()
install_sudachipy()
install_sudachipy()
requires Python and virtualenv installed.
See https://www.python.org/getit/.
## Not run: install_sudachipy() ## End(Not run)
## Not run: install_sudachipy() ## End(Not run)
Rebuild 'Sudachi' tokenizer
rebuild_tokenizer( mode = c("C", "B", "A"), dict_type = c("core", "small", "full"), config_path = NULL )
rebuild_tokenizer( mode = c("C", "B", "A"), dict_type = c("core", "small", "full"), config_path = NULL )
mode |
Split mode (A, B, C) |
dict_type |
Dictionary type. |
config_path |
Absolute path to |
Returns a binding to the instance of <sudachipy.tokenizer.Tokenizer>
.
## Not run: tokenizer <- rebuild_tokenizer() tokenize_to_df("Tokyo, Japan", instance = tokenizer) ## End(Not run)
## Not run: tokenizer <- rebuild_tokenizer() tokenize_to_df("Tokyo, Japan", instance = tokenizer) ## End(Not run)
Uninstalls SudachiPy by removing the virtualenv environment.
remove_sudachipy()
remove_sudachipy()
## Not run: install_sudachipy() remove_sudachipy() ## End(Not run)
## Not run: install_sudachipy() remove_sudachipy() ## End(Not run)
The old tokenizer()
function was removed.
tokenizer(...)
tokenizer(...)
... |
Not used. |
In general, users should not directly touch
the <sudachipy.tokenizer.Tokenizer>
and its MorphemeList
objects.
If you must access those objects,
use the return value of the rebuild_tokenizer()
function.
Create a data.frame of tokens
tokenize_to_df( x, text_field = "text", docid_field = "doc_id", into = dict_features(), col_select = seq_along(into), instance = rebuild_tokenizer(), ... )
tokenize_to_df( x, text_field = "text", docid_field = "doc_id", into = dict_features(), col_select = seq_along(into), instance = rebuild_tokenizer(), ... )
x |
A data.frame like object or a character vector to be tokenized. |
text_field |
Column name where to get texts to be tokenized. |
docid_field |
Column name where to get identifiers of texts. |
into |
Column names of features. |
col_select |
Character or integer vector of column names
that kept in the return value. When passed as |
instance |
A binding to the instance of |
... |
Currently not used. |
A tibble.
## Not run: tokenize_to_df( "Tokyo, Japan", into = dict_features("en"), col_select = c("pos1", "pos2") ) ## End(Not run)
## Not run: tokenize_to_df( "Tokyo, Japan", into = dict_features("en"), col_select = c("pos1", "pos2") ) ## End(Not run)