Title: | Tools, Measures and Statistical Tests for Cultural Evolution |
---|---|
Description: | Provides tools and statistics useful for analysing data from artificial language experiments. It implements the information-theoretic measure of the compositionality of signalling systems due to Spike (2016) <http://hdl.handle.net/1842/25930>, the Mantel test for distance matrix correlation (after Dietz 1983) <doi:10.1093/sysbio/32.1.21>), functions for computing string and meaning distance matrices as well as an implementation of the Page test for monotonicity of ranks (Page 1963) <doi:10.1080/01621459.1963.10500843> with exact p-values up to k = 22. |
Authors: | Kevin Stadler [aut, cre] |
Maintainer: | Kevin Stadler <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.3 |
Built: | 2024-10-29 03:16:43 UTC |
Source: | https://github.com/kevinstadler/cultevo |
Transforms a meaning matrix to 'wide' format where, instead of having
a column for every meaning dimension store all possible meaning values,
every possible value for any dimension is treated as its own categorical
'meaning feature' whose presence or absence is represented by a logical
TRUE
/FALSE
value in its own meaning feature column.
binaryfeaturematrix(meanings, rownames = NULL)
binaryfeaturematrix(meanings, rownames = NULL)
meanings |
a matrix or data frame with meaning dimensions along columns
and different meaning combinations along rows (such as created by
|
rownames |
optional character vector of the same length as the number
of rows of |
Given a matrix or data frame with meaning dimensions along columns and different combinations of meaning feature values along rows, creates a a matrix with the same number of rows but with one column for every possible value for every meaning dimension.
All meaning dimensions and values are treated categorically, i.e. as factors with no gradual notion of meaning feature similarity, neither within nor across the original meaning dimensions. Information about which feature values correspond to which meaning dimensions is essentially discarded in this representation, but could in principle be recovered through the patterns of (non)-co-occurrence of different meaning features.
In order for the resulting meaning columns to be interpretable, the column
names of the result are of the structure columnname=value
, based on
the column names of the input meaning matrix (see Examples).
A matrix of TRUE
/FALSE
values with as many rows as
meanings
and one column for every column-value combination in
meanings
.
enumerate.meaningcombinations(c(2, 2)) binaryfeaturematrix(enumerate.meaningcombinations(c(2, 2)))
enumerate.meaningcombinations(c(2, 2)) binaryfeaturematrix(enumerate.meaningcombinations(c(2, 2)))
Checks or fixes the given distance matrix specification and returns an
equivalent, symmetric matrix
object with 0s in the diagonal.
check.dist(x)
check.dist(x)
x |
an object (or list of objects) specifying a distance matrix |
If the argument is a matrix, check whether it is a valid specification of a distance matrix and return it, making it symmetric if it isn't already.
If the argument is a list, calls check.dist
on every of its elements
and returns a list of the results.
For all other object types, attempts to coerce the argument to a dist
object and return the corresponding distance matrix (see above).
a symmetric matrix
object (or list of such objects) of the
same dimension as x
Count occurences of all possible substrings in one more strings.
count.substring.occurrences(strings, sortbylength = FALSE)
count.substring.occurrences(strings, sortbylength = FALSE)
strings |
a list or vector of character sequences |
sortbylength |
logical indicating whether the substring columns should be ordered according to the (decreasing) length of the substrings. Default is to leave them in the original order in which they occur in the given strings. |
A matrix with the original strings along rows and all substrings of those strings along columns. The cell values indicate whether (and how many times) the substring is contained in each of the strings.
count.substring.occurrences(c("asd", "asdd", "foo"))
count.substring.occurrences(c("asd", "asdd", "foo"))
Enumerates all possible combinations of meanings for a meaning space of the given dimensionality.
enumerate.meaningcombinations(dimensionality, uniquelabels = TRUE, offset = 0)
enumerate.meaningcombinations(dimensionality, uniquelabels = TRUE, offset = 0)
dimensionality |
either a) a vector of integers specifying the number of different possible values for every meaning dimension, or b) a list or other (potentially ragged) 2-dimensional data structure listing the possible meaning values for every dimension |
uniquelabels |
logical, determines whether the same integers can be
reused across meaning dimensions or not. When |
offset |
a constant that is added to all meaning specifiers. Ignored
when |
The resulting matrix can be passed straight on to
hammingdists
and other meaning distance functions created by
wrap.meaningdistfunction
.
A matrix that has as many columns as there are dimensions, with every row specifying one of the possible meaning combinations. The entries of the first dimension cycle slowest (see examples).
enumerate.meaningcombinations(c(2, 2)) enumerate.meaningcombinations(c(3, 4)) enumerate.meaningcombinations(c(2, 2, 2, 2)) enumerate.meaningcombinations(8) # trivial enumerate.meaningcombinations(list(shape=c("square", "circle"), color=c("red", "blue")))
enumerate.meaningcombinations(c(2, 2)) enumerate.meaningcombinations(c(3, 4)) enumerate.meaningcombinations(c(2, 2, 2, 2)) enumerate.meaningcombinations(8) # trivial enumerate.meaningcombinations(list(shape=c("square", "circle"), color=c("red", "blue")))
Enumerate all substrings of a string.
enumerate.substrings(string)
enumerate.substrings(string)
string |
a character string |
a vector containing all substrings of the string (including duplicates)
enumerate.substrings("abccc")
enumerate.substrings("abccc")
Returns a distance matrix giving all pairwise Hamming distances between the
rows of its argument meanings
, which can be a matrix, data frame or
vector. Vectors are treated as matrices with a single column, so the
distances in its return value can only be 0 or 1.
hammingdists(meanings)
hammingdists(meanings)
meanings |
a matrix with the different dimensions encoded along
columns, and all combinations of meanings specified along rows. The data
type of the cells does not matter since distance is simply based on
equality (with the exception of |
This function behaves differently from calling
dist(meanings, method="manhattan")
in how NA
values are treated: specifying a meaning component as NA
allows you
to ignore that dimension for the given row/meaning combinations,
(instead of counting a difference between NA
and another value as a
distance of 1).
A distance matrix of type dist
with n*(n-1)/2
rows/columns, where n is the number of rows in meanings
.
# a 2x2 design using strings print(strings <- matrix(c("a1", "b1", "a1", "b2", "a2", "b1", "a2", "b2"), ncol=2, byrow=TRUE)) hammingdists(strings) # a 2x3 design using integers print(integers <- matrix(c(0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2), ncol=2, byrow=TRUE)) hammingdists(integers) # a 3x2 design using factors (ncol is always the number of dimensions) print(factors <- data.frame(colour=c("red", "red", "green", "blue"), animal=c("dog", "cat", "dog", "cat"))) hammingdists(factors) # if some meaning dimension is not relevant for some combinations of # meanings (e.g. optional arguments), specifying them as NA in the matrix # will make them not be counted towards the hamming distance! in this # example the value of the second dimension does not matter (and does not # count towards the distance) when the the first dimension has value '1' print(ignoredimension <- matrix(c(0, 0, 0, 1, 1, NA), ncol=2, byrow=TRUE)) hammingdists(ignoredimension) # trivial case of a vector: first and last two elements are identical, # otherwise a difference of one hammingdists(c(0, 0, 1, 1))
# a 2x2 design using strings print(strings <- matrix(c("a1", "b1", "a1", "b2", "a2", "b1", "a2", "b2"), ncol=2, byrow=TRUE)) hammingdists(strings) # a 2x3 design using integers print(integers <- matrix(c(0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2), ncol=2, byrow=TRUE)) hammingdists(integers) # a 3x2 design using factors (ncol is always the number of dimensions) print(factors <- data.frame(colour=c("red", "red", "green", "blue"), animal=c("dog", "cat", "dog", "cat"))) hammingdists(factors) # if some meaning dimension is not relevant for some combinations of # meanings (e.g. optional arguments), specifying them as NA in the matrix # will make them not be counted towards the hamming distance! in this # example the value of the second dimension does not matter (and does not # count towards the distance) when the the first dimension has value '1' print(ignoredimension <- matrix(c(0, 0, 0, 1, 1, NA), ncol=2, byrow=TRUE)) hammingdists(ignoredimension) # trivial case of a vector: first and last two elements are identical, # otherwise a difference of one hammingdists(c(0, 0, 1, 1))
Perform correlation tests between pairs of distance matrices. The Mantel
test is different from classical correlation tests (such as those
implemented by cor.test
) in that the null distribution
(and significance level) are obtained through randomisation. The null
distribution is generated by shuffling the locations (matrix rows and
columns) of one of the matrices to calculate an empirical null distribution
for the given data set.
mantel.test(x, y, ...) ## Default S3 method: mantel.test( x, y, plot = FALSE, method = c("spearman", "kendall", "pearson"), trials = 9999, omitzerodistances = FALSE, ... ) ## S3 method for class 'formula' mantel.test( x, y, groups = NULL, stringdistfun = utils::adist, meaningdistfun = hammingdists, ... ) ## S3 method for class 'list' mantel.test(x, y, plot = FALSE, ...) ## S3 method for class 'mantel' plot(x, xlab = "generation", ...)
mantel.test(x, y, ...) ## Default S3 method: mantel.test( x, y, plot = FALSE, method = c("spearman", "kendall", "pearson"), trials = 9999, omitzerodistances = FALSE, ... ) ## S3 method for class 'formula' mantel.test( x, y, groups = NULL, stringdistfun = utils::adist, meaningdistfun = hammingdists, ... ) ## S3 method for class 'list' mantel.test(x, y, plot = FALSE, ...) ## S3 method for class 'mantel' plot(x, xlab = "generation", ...)
x |
a formula, distance matrix, or list of distance matrices (see below) |
y |
a data frame, distance matrix, or list of distance matrices of the
same length as |
... |
further arguments which are passed on to the default method (in
particular |
plot |
logical: immediately produce a plot of the test results (default:
|
method |
correlation coefficient to be computed. Passed on to
|
trials |
integer: maximum number of random permutations to be computed (see Details). |
omitzerodistances |
logical: if |
groups |
when |
stringdistfun |
when |
meaningdistfun |
when |
xlab |
the x axis label used when plotting the result of several Mantel tests next to each other |
If the number of possible permutations of the matrices is reasonably close
to the number of permutations specified by the trials
parameter, a
deterministic enumeration of all the permutations will be carried out
instead of random sampling: such a deterministic test will return an exact
p-value.
plot()
called on a data frame of class mantel
plots a
visualisation of the test results (in particular, the distribution of
the permutated samples against the veridical correlation coefficient). If
the veridical correlation coefficient is plotted in blue it means
that it was higher than all other coefficients generated by random
permutations of the data. When the argument contains the result of more than
one Mantel tests, a side-by-side boxplot visualisation shows the mean and
standard deviation of the randomised samples (see examples). Additional
parameters ...
to plot()
are passed on to
plot.default
.
A dataframe of class mantel
, with one row per Mantel test carried
out, containing the following columns:
method
Character string: type of correlation coefficient used
statistic
The veridical correlation coefficient between the entries in the two distance matrices
rsample
A list of correlation coefficients calculated from the permutations of the input matrices
mean
Average correlation coefficient produced by the permutations
sd
Standard deviation of the sampled correlation coefficients
p.value
Empirical p-value computed from the Mantel
test: let ngreater
be the number of correlation coefficients
in rsample
greater than or equal to statistic
, then
p.value
is (ngreater+1)/(length(rsample)+1
(North, Curtis and Sham 2002).
p.approx
The theoretical p-value that would correspond
to the standard z
score as calculated above.
is.unique.max
Logical, TRUE
iff the veridical
correlation coefficient is greater than any of the coefficients
calculated for the permutations. If this is true, then
p.value == 1 / (length(rsample)+1)
Multiple mantel
objects can easily be combined by calling
rbind(test1, test2, ...)
.
mantel.test(default)
: Perform Mantel correlation test on two distance
matrices. The distance matrices can either be of type
dist
, plain R matrices or any object that can be
interpreted by check.dist
. The order of the two matrices does
not matter unless omitzerodistances = TRUE
, in which case cells with
a 0 in the second matrix are omitted from the calculation of the
correlation coefficient. For consistency it is therefore recommended to
always pass the string distance matrix first, meaning distance matrix second.
mantel.test(formula)
: This function can be called with raw experimental
result data frames, distance matrix calculation is taken care of internally.
x
is a formula of the type s ~ m1 + m2 + ...
where s
is the column name of the character strings in data frame or matrix y
,
while m1
etc. are the column names specifying the different meaning
dimensions. To calculate the respective distances, the function
stringdistfun
is applied to the strings, meaningdistfun
to the
meaning columns.
mantel.test(list)
: When x
is a list of distance matrices, and
y
is either a single distance matrix or a list of distance matrices
the same length as x
: runs a Mantel test for every pairwise
combination of distance matrices in x
and y
and returns a
mantel
object with as many rows.
Dietz, E. J. 1983 “Permutation Tests for Association Between Two Distance Matrices.” Systematic Biology 32 (1): 21-–26. https://doi.org/10.1093/sysbio/32.1.21.
North, B. V., D. Curtis and P. C. Sham. 2002 “A Note on the Calculation of Empirical P Values from Monte Carlo Procedures.” The American Journal of Human Genetics 71 (2): 439-–41. https://doi.org/10.1086/341527.
cor
,
adist
, hammingdists
,
normalisedlevenshteindists
,
orderinsensitivedists
# small distance matrix, Mantel test run deterministically mantel.test(dist(1:7), dist(1:7)) ## Not run: # run test on smallest distance matrix which requires a random # permutation test, and plot it plot(mantel.test(dist(1:8), dist(1:8), method="kendall")) ## End(Not run) ## Not run: # 2x2x2x2 design mantel.test(hammingdists(enumerate.meaningcombinations(c(2, 2, 2, 2))), dist(1:16), plot=TRUE) ## End(Not run) # using the formula interface in combination with a data frame: print(data <- cbind(word=c("aa", "ab", "ba", "bb"), enumerate.meaningcombinations(c(2, 2)))) mantel.test(word ~ Var1 + Var2, data) ## Not run: # pass a list of distance matrices as the first argument, but just one # distance matrix as the second argument: this runs separate tests on # the pairwise combinations of the first and second argument result <- mantel.test(list(dist(1:8), dist(sample(8:1)), dist(runif(8))), hammingdists(enumerate.meaningcombinations(c(2, 2, 2)))) # print the result of the three independently run permutation tests print(result) # show the three test results in one plot plot(result, xlab="group") ## End(Not run)
# small distance matrix, Mantel test run deterministically mantel.test(dist(1:7), dist(1:7)) ## Not run: # run test on smallest distance matrix which requires a random # permutation test, and plot it plot(mantel.test(dist(1:8), dist(1:8), method="kendall")) ## End(Not run) ## Not run: # 2x2x2x2 design mantel.test(hammingdists(enumerate.meaningcombinations(c(2, 2, 2, 2))), dist(1:16), plot=TRUE) ## End(Not run) # using the formula interface in combination with a data frame: print(data <- cbind(word=c("aa", "ab", "ba", "bb"), enumerate.meaningcombinations(c(2, 2)))) mantel.test(word ~ Var1 + Var2, data) ## Not run: # pass a list of distance matrices as the first argument, but just one # distance matrix as the second argument: this runs separate tests on # the pairwise combinations of the first and second argument result <- mantel.test(list(dist(1:8), dist(sample(8:1)), dist(runif(8))), hammingdists(enumerate.meaningcombinations(c(2, 2, 2)))) # print the result of the three independently run permutation tests print(result) # show the three test results in one plot plot(result, xlab="group") ## End(Not run)
Compute the normalised Levenshtein distances between strings.
normalisedlevenshteindists(strings)
normalisedlevenshteindists(strings)
strings |
a vector or list of strings |
A distance matrix specifying all pairwise normalised Levenshtein distances between the strings.
normalisedlevenshteindists(c("abd", "absolute", "asdasd", "casd"))
normalisedlevenshteindists(c("abd", "absolute", "asdasd", "casd"))
Calculate the bag-of-characters similarity between strings.
orderinsensitivedists( strings = NULL, split = NULL, segmentcounts = segment.counts(strings, split) )
orderinsensitivedists( strings = NULL, split = NULL, segmentcounts = segment.counts(strings, split) )
strings |
a vector or list of strings |
split |
boundary sequency at which to segment the strings (default splits the string into all its constituent characters) |
segmentcounts |
if custom segmentation is required, the pre-segmented strings can be passed as this argument (which is a list of lists) |
a distance matrix
orderinsensitivedists(c("xxxx", "asdf", "asd", "dsa"))
orderinsensitivedists(c("xxxx", "asdf", "asd", "dsa"))
Given N
replications of k
different treatments/conditions,
tests whether the median ordinal ranks of the treatments
are identical
against the alternative hypothesis
where at least one of the inequalities is a strict inequality (Siegel and Castellan 1988, p.184). Given that even a single point change in the distribution of ranks across conditions represents evidence against the null hypothesis, the Page test is simply a test for some ordered differences in ranks, but not a 'trend test' in any meaningful way (see also the Page test tutorial).
page.test(data, verbose = TRUE) page.L(data, verbose = TRUE, ties.method = "average") page.compute.exact(k, N, L = NULL)
page.test(data, verbose = TRUE) page.L(data, verbose = TRUE, ties.method = "average") page.compute.exact(k, N, L = NULL)
data |
a matrix with the different conditions along its |
verbose |
whether to print the final rankings based on which the L statistic is computed |
ties.method |
how to resolve tied ranks. Passed on to
|
k |
number of conditions/generations |
N |
number of replications/chains |
L |
value of the Page L statistic |
Tests the given matrix for monotonically increasing ranks across k
linearly ordered conditions (along columns) based on N
replications
(along rows). To test for monotonically decreasing ranks, either reverse
the order of columns, or simply invert the rank ordering by calling -
on
the entire dataset.
Exact p-values are computed for k
up to 22, using the pre-computed null
distributions from the
pspearman
package. For
larger k
, p-values are computed based on a Normal distribution
approximation (Siegel and Castellan, 1988).
page.test
returns a list of class pagetest
(and
htest
) containing the following elements:
statistic
value of the L statistic for the data set
parameter
a named vector specifying the number of conditions (k) and replications (N) of the data (which is the number of columns and rows of the data set, respectively)
p.value
significance level
p.type
whether the computed p-value is "exact"
or
"approximate"
page.test()
: See above.
page.L()
: Calculate Page's L statistic for the given dataset.
page.compute.exact()
: Calculate exact significance levels of the Page L
statistic. Returns a single numeric indicating the null probability of
the Page statistic with the given k
, N
being greater or equal than the
given L
.
Siegel, S., and N. J. Castellan, Jr. (1988). Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill.
# exact p value computation for N=4, k=4 page.test(t(replicate(4, sample(4)))) # exact p value computation for N=4, k=10 page.test(t(replicate(4, sample(10)))) # approximate p value computation for N=4, k=23 result <- page.test(t(replicate(4, sample(23))), verbose = FALSE) print(result) # raw calculation of the significance levels page.compute.exact(6, 4, 322)
# exact p value computation for N=4, k=4 page.test(t(replicate(4, sample(4)))) # exact p value computation for N=4, k=10 page.test(t(replicate(4, sample(10)))) # approximate p value computation for N=4, k=23 result <- page.test(t(replicate(4, sample(23))), verbose = FALSE) print(result) # raw calculation of the significance levels page.compute.exact(6, 4, 322)
Read a distance matrix from a file or data frame.
read.dist(data, el1.column = 1, el2.column = 2, dist.columns = 3)
read.dist(data, el1.column = 1, el2.column = 2, dist.columns = 3)
data |
a filename, data frame or matrix |
el1.column |
the column name or id specifying the first element |
el2.column |
the column name or id specifying the second element |
dist.columns |
the column name(s) or id(s) specifying the distance(s) between the two corresponding elements |
a distance matrix (or list of distance matrixes when there is more
than one dist.columns
) of type matrix
read.dist(cbind(c(1,1,1,2,2,3), c(2,3,4,3,4,4), 1:6, 6:1), dist.columns=c(3,4))
read.dist(cbind(c(1,1,1,2,2,3), c(2,3,4,3,4,4), 1:6, 6:1), dist.columns=c(3,4))
Returns a new matrix, where the entries of the original matrix are repeated along both dimensions.
repmatrix( x, times = 1, each = 1, times.row = times, times.col = times, each.row = each, each.col = each, ... )
repmatrix( x, times = 1, each = 1, times.row = times, times.col = times, each.row = each, each.col = each, ... )
x |
a matrix |
times |
how often the matrix should be replicated next to itself |
each |
how often individual cells should be replicated next to themselves |
times.row |
number of vertical repetitions of the matrix, overrides |
times.col |
number of horizontal repetitions of the matrix, overrides |
each.row |
number of vertical repetitions of individual elements, overrides |
each.col |
number of horizontal repetitions of individual elements, overrides |
... |
not used |
A matrix, which will have times*each
times more rows and
columns than the original matrix.
repmatrix(diag(4)) repmatrix(diag(4), times=2) repmatrix(diag(4), each=2) repmatrix(diag(3), times=2, each=2) repmatrix(diag(4), each.row=2) repmatrix(diag(4), times.row=2)
repmatrix(diag(4)) repmatrix(diag(4), times=2) repmatrix(diag(4), each=2) repmatrix(diag(3), times=2, each=2) repmatrix(diag(4), each.row=2) repmatrix(diag(4), times.row=2)
Split strings into their constituent segments (and count them).
segment.string(x, split = NULL) segment.counts(x, split = NULL)
segment.string(x, split = NULL) segment.counts(x, split = NULL)
x |
one or more strings to be split (and, optionally, counted) |
split |
the boundary character or sequence at which to segment the
string(s). The default, |
segment.string()
: Returns a list (of the same length as x
), each item a vector of
character vectors.
segment.counts()
: Calculate the frequency of individual characters in one or more strings.
Returns a matrix with one row for every string in x
.
segment.string(c("asd", "fghj")) segment.string(c("la-dee-da", "lala-la"), "-") segment.counts(c("asd", "aasd", "asdf"))
segment.string(c("asd", "fghj")) segment.string(c("la-dee-da", "lala-la"), "-") segment.counts(c("asd", "aasd", "asdf"))
Returns the given matrix with rows and columns permuted in the same order.
shuffle.locations(m, perm = sample.int(dim(m)[1]))
shuffle.locations(m, perm = sample.int(dim(m)[1]))
m |
a matrix with an equal number of rows and columns |
perm |
vector of indices specifying the new order of rows/columns |
a matrix of the same size as m
Implementation of the Spike-Montague segmentation and measure of additive compositionality (Spike 2016), which finds the most predictive associations between meaning features and substrings. Computation is deterministic and fast.
sm.compositionality(x, y, groups = NULL, strict = FALSE) sm.segmentation(x, y, strict = FALSE)
sm.compositionality(x, y, groups = NULL, strict = FALSE) sm.segmentation(x, y, strict = FALSE)
x |
a list or vector of character sequences specifying the signals to
be analysed. Alternatively, |
y |
a matrix or data frame with as many rows as there are signals,
indicating the presence/value of the different meaning dimensions along
columns (see section Meaning data format). If |
groups |
a list or vector with as many items as strings, used to split
|
strict |
logical: if |
The algorithm works on compositional meanings that can be expressed as sets of categorical meaning features (see below), and does not take the order of elements into account. Rather than looking directly at how complex meanings are expressed, the measure really captures the degree to which a homonymy- and synonymy-free signalling system exists at the level of individual semantic features.
The segmentation algorithm provided by sm.segmentation()
scans through
all sub-strings found in strings
to find the pairings of meaning features
and sub-strings whose respective presence is most predictive of each
other. Mathematically, for every meaning feature , it finds
the sub-string
from the set of strings
that yields the
highest mutual predictability across all signals,
Based on the mutual predictability levels obtained for the individual
meaning features, sm.compositionality
then computes the mean mutual
predictability weighted by the individual features' relative frequencies of
attestation, i.e.
as a measure of the overall compositionality of the signalling system.
Since mutual predictability is determined seperately for every meaning
feature, the most predictive sub-strings posited for different meaning
features as returned by sm.segmentation()
can overlap, and even coincide
completely. Such results are generally indicative of either limited data
(in particular frequent co-occurrence of the meaning features in question),
or spurious results in the absence of a consistent signalling system. The
latter will also be indicated by the significance level of the given mutual
predictability.
sm.segmentation
provides detailed information about the most
predictably co-occurring segments for every meaning feature. It returns
a data frame with one row for every meaning feature, in descending order
of the mutual predictability from (and to) their corresponding string
segments. The data frame has the following columns:
N
The number of signal-meaning pairings in which this meaning feature was attested.
mp
The highest mutual predictability between this meaning feature and one (or more) segments that was found.
p
Significance levels of the given mutual predictability, i.e. the probability that the given mutual predictability level could be reached by chance. The calculation depends on the frequency of the meaning feature as well as the number and relative frequency of all substrings across all signals (see below).
ties
The number of substrings found in strings
which have this same level of mutual predictability with the meaning
feature.
segments
For strict=FALSE
: a list containing the
ties
substrings in descending order of their length (the
ordering is for convenience only and not inherently meaningful). When
strict=TRUE
, the lists of segments for each meaning feature
are all of the same length, with a meaningful relationship of the
order of segments across the different rows: every set of segments
which are found in the same position for each of the different
meaning features constitute a valid segmentation where the segments
occurrences in the actual signals do not overlap.
sm.compositionality
calculates the weighted average of the
mutual predictability of all meaning features and their most predictably
co-occurring strings, as computed by sm.segmentation
. The function
returns a data frame of three columns:
N
is the total number of signals (utterances) on which the computation
was based, M
the number of distinct meaning features attested across
all signals, and meanmp
the mean mutual predictability across all these
features, weighted by the features' relative frequency. When groups
is
not NULL
, the data frame contains one row for every group.
A perfectly unambiguous mapping between a meaning feature to a specific
string segment will always yield a mutual predictability of 1
. In the
absence of such a regular mapping, on the other hand, chance co-occurrences
of strings and meanings will in most cases stop the mutual predictability
from going all the way down to 0
. In order to help distinguish chance
co-occurrence levels from significant signal-meaning associations,
sm.segmentation()
provides significance levels for the mutual
predictability levels obtained for each meaning feature.
What is the baseline level of association between a meaning feature and a set of sub-strings that we would expect to be due to chance co-occurrences? This depends on several factors, from the number of data points on which the analysis is based to the frequency of the meaning feature in question and, perhaps most importantly, the overall makeup of the different substrings that are present in the signals. Since every substring attested in the data is a candidate for signalling the presence of a meaning feature, the absolute number of different substrings greatly affects the likelihood of chance signal-meaning associations. (Diversity of the set of substrings is in turn heavily influenced by the size of the underlying alphabet, a factor which is often not appreciated.)
For every candidate substring, the degree of association with a specific meaning feature that we would expect by chance is again dependent on the absolute number of signals in which the substring is attested.
Starting from the simplest case, take a meaning that is featured in
of the total
signals (where
). Assume next that
there is a string segment that is attested in
of these signals
(where again
). The degree of association between the
meaning feature and string segment is dependent on the number of times that
they co-occur, which can be no more than
times.
The null probability of getting a given number of co-occurrences can be
obtained by considering all possible reshufflings of the meaning feature in
question across all signals: if
signals contain a given substring,
how many of
randomly drawn signals from the pool of
signals
would contain the meaning feature if a total of
signals in the pool
did? Approached from this angle, the likelihood of the number of
co-occurrences follows the
hypergeometric distribution,
with
being the number of successes when taking
draws without
replacement from a population of size
with fixed number of successes
.
For every number of co-occurrences , one can
compute the corresponding mutual probability level as
to obtain the null distribution of mutual
predictability levels between a meaning feature and one substring of a
particular frequency
:
From this, we can now derive the null distribution for the entire set of
attested substrings as follows: making the simplifying assumption that the
occurrences of different substrings are independent of each other, we first
aggregate over the null distributions of all the individual substrings to
obtain the mean probability of finding a given mutual
predictability level at least as high as
for one randomly drawn
string from the entire population of substrings. Assuming the total number
of candidate substrings is
, the overall null probability that at
least one of them would yield a mutual predictability at least as high is
Note that, since the null distribution also depends on the frequency with which the meaning feature is attested, the significance levels corresponding to a given mutual predictability level are not necessarily identical for all meaning features, even within one analysis.
(In theory, one can also compute an overall p-value of the weighted mean
mutual predictability as calculated by sm.compositionality
. However, the
significance levels for the individual meaning features are much more
insightful and should therefore be consulted directly.)
The meanings
argument can be a matrix or data frame in one of two formats.
If it is a matrix of logicals (TRUE
/FALSE
values), then the columns are
assumed to refer to meaning features, with individual cells indicating
whether the meaning feature is present or absent in the signal represented
by that row (see binaryfeaturematrix()
for an explanation). If meanings
is a data frame or matrix of any other type, it is assumed that the columns
specify different meaning dimensions, with the cell values showing the
levels with which the different dimensions can be realised. This
dimension-based representation is automatically converted to a
feature-based one by calling binaryfeaturematrix()
. As a consequence,
whatever the actual types of the columns in the meaning matrix, they will
be treated as categorical factors for the purpose of this algorithm, also
discarding any explicit knowledge of which 'meaning dimension' they might
belong to.
Spike, M. 2016 Minimal requirements for the cultural evolution of language. PhD thesis, The University of Edinburgh. http://hdl.handle.net/1842/25930.
binaryfeaturematrix()
, ssm.compositionality()
# perfect communication system for two meaning features (which are marked # as either present or absent) sm.compositionality(c("a", "b", "ab"), cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE))) sm.segmentation(c("a", "b", "ab"), cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE))) # not quite perfect communication system sm.compositionality(c("as", "bas", "basf"), cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE))) sm.segmentation(c("as", "bas", "basf"), cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE))) # same communication system, but force candidate segments to be non-overlapping # via the 'strict' option sm.segmentation(c("as", "bas", "basf"), cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE)), strict=TRUE) # the function also accepts meaning-dimension based matrix definitions: print(twobytwoanimals <- enumerate.meaningcombinations(c(animal=2, colour=2))) # note how there are many more candidate segments than just the full length # ones. the less data we have, the more likely it is that shorter substrings # will be just as predictable as the full segments that contain them. sm.segmentation(c("greendog", "bluedog", "greencat", "bluecat"), twobytwoanimals) # perform the same analysis, but using the formula interface print(twobytwosignalingsystem <- cbind(twobytwoanimals, signal=c("greendog", "bluedog", "greencat", "bluecat"))) sm.segmentation(signal ~ colour + animal, twobytwosignalingsystem) # since there is no overlap in the constituent characters of the identified # 'morphemes', they are all tied in their mutual predictiveness with the # (shorter) substrings they contain # # to reduce the pool of candidate segments to those which are # non-overlapping and of maximal length, again use the 'strict=TRUE' option: sm.segmentation(signal ~ colour + animal, twobytwosignalingsystem, strict=TRUE)
# perfect communication system for two meaning features (which are marked # as either present or absent) sm.compositionality(c("a", "b", "ab"), cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE))) sm.segmentation(c("a", "b", "ab"), cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE))) # not quite perfect communication system sm.compositionality(c("as", "bas", "basf"), cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE))) sm.segmentation(c("as", "bas", "basf"), cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE))) # same communication system, but force candidate segments to be non-overlapping # via the 'strict' option sm.segmentation(c("as", "bas", "basf"), cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE)), strict=TRUE) # the function also accepts meaning-dimension based matrix definitions: print(twobytwoanimals <- enumerate.meaningcombinations(c(animal=2, colour=2))) # note how there are many more candidate segments than just the full length # ones. the less data we have, the more likely it is that shorter substrings # will be just as predictable as the full segments that contain them. sm.segmentation(c("greendog", "bluedog", "greencat", "bluecat"), twobytwoanimals) # perform the same analysis, but using the formula interface print(twobytwosignalingsystem <- cbind(twobytwoanimals, signal=c("greendog", "bluedog", "greencat", "bluecat"))) sm.segmentation(signal ~ colour + animal, twobytwosignalingsystem) # since there is no overlap in the constituent characters of the identified # 'morphemes', they are all tied in their mutual predictiveness with the # (shorter) substrings they contain # # to reduce the pool of candidate segments to those which are # non-overlapping and of maximal length, again use the 'strict=TRUE' option: sm.segmentation(signal ~ colour + animal, twobytwosignalingsystem, strict=TRUE)
This algorithm builds on Spike's measure of compositionality (see
sm.compositionality
), except instead of simply determining
which segment(s) have the highest mutual predictability for each
meaning feature separately, it attempts to find a combination of
non-overlapping segments for each feature that maximises the overall string
coverage over all signals. In other words, it tries to find a segmentation
which can account for (or 'explain') as much of the string material in the
signals as possible.
ssm.compositionality(x, y, groups = NULL) ssm.segmentation(x, y, mergefeatures = FALSE, verbose = FALSE)
ssm.compositionality(x, y, groups = NULL) ssm.segmentation(x, y, mergefeatures = FALSE, verbose = FALSE)
x |
a list or vector of character sequences |
y |
a matrix or data frame with as many rows as there are strings (see section Meaning data format) |
groups |
a list or vector with as many items as strings, used to split the signals and meanings into data sets for which the compositionality measures are computed separately. |
mergefeatures |
logical: if |
verbose |
logical: if |
For large data sets and long strings, this computation can get very slow. If the attested signals are such that no perfect segmentation is possible, this algorithm is not guaranteed to find any segmentation (as no such segmentation might exist).
ssm.segmentation(c("as", "bas", "basf"), cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE))) # signaling system where one meaning distinction is not encoded in the signals print(threebytwoanimals <- enumerate.meaningcombinations(list(animal=c("dog", "cat", "tiger"), colour=c("col1", "col2")))) ssm.segmentation(c("greendog", "bluedog", "greenfeline", "bluefeline", "greenfeline", "bluefeline"), threebytwoanimals) # the same analysis again, but allow merging of features ssm.segmentation(c("greendog", "bluedog", "greenfeline", "bluefeline", "greenfeline", "bluefeline"), threebytwoanimals, mergefeatures=TRUE)
ssm.segmentation(c("as", "bas", "basf"), cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE))) # signaling system where one meaning distinction is not encoded in the signals print(threebytwoanimals <- enumerate.meaningcombinations(list(animal=c("dog", "cat", "tiger"), colour=c("col1", "col2")))) ssm.segmentation(c("greendog", "bluedog", "greenfeline", "bluefeline", "greenfeline", "bluefeline"), threebytwoanimals) # the same analysis again, but allow merging of features ssm.segmentation(c("greendog", "bluedog", "greenfeline", "bluefeline", "greenfeline", "bluefeline"), threebytwoanimals, mergefeatures=TRUE)
Create a vector of 'temperature' colors (from blue over white to red).
temperature.colors(mn, mx = NULL, intensity = 1)
temperature.colors(mn, mx = NULL, intensity = 1)
mn |
integer: when |
mx |
integer: 'warmest' temperature (see examples) |
intensity |
saturation of the most extreme color(s), in the range |
# full intensity image(as.matrix(1:7), z=as.matrix(1:7), col=temperature.colors(7)) # half intensity image(as.matrix(1:7), z=as.matrix(1:7), col=temperature.colors(7, intensity=0.5)) # skewed palette with more negative than positive temperature colors image(as.matrix(1:7), z=as.matrix(1:7), col=temperature.colors(-4, 2))
# full intensity image(as.matrix(1:7), z=as.matrix(1:7), col=temperature.colors(7)) # half intensity image(as.matrix(1:7), z=as.matrix(1:7), col=temperature.colors(7, intensity=0.5)) # skewed palette with more negative than positive temperature colors image(as.matrix(1:7), z=as.matrix(1:7), col=temperature.colors(-4, 2))
This function takes as its only argument a function f(m1, m2)
which
returns a single numeric indicating the distance between two 'meanings'
m1, m2
(which are themselves most likely vectors or lists). Based
on f
, this function returns a function g(mm)
which takes as
its only argument a matrix or data frame mm
with the meaning
elements (equivalent to the ones in m1, m2
) along columns and
different meaning combinations (like m1, m2, ...
) along rows. This
function returns a distance matrix of class dist
containing all pairwise distances between the rows of mm
. The
resulting function g
can be passed to other functions in this
package, in particular mantel.test
.
wrap.meaningdistfunction(pairwisemeaningdistfun)
wrap.meaningdistfunction(pairwisemeaningdistfun)
pairwisemeaningdistfun |
a function of two arguments returning a single numeric indicating the semantic distance between its arguments |
The meaning distance function should be commutative, i.e.
f(a,b) = f(b,a)
, and meanings should have a distance of zero to
themselves, i.e. f(a,a) = 0
.
A function that takes a meaning matrix and returns a corresponding
distance matrix of class dist
.
trivialdistance <- function(a, b) return(a - b) trivialmeanings <- as.matrix(3:1) trivialdistance(trivialmeanings[1], trivialmeanings[2]) trivialdistance(trivialmeanings[1], trivialmeanings[3]) trivialdistance(trivialmeanings[2], trivialmeanings[3]) distmatrixfunction <- wrap.meaningdistfunction(trivialdistance) distmatrixfunction(trivialmeanings)
trivialdistance <- function(a, b) return(a - b) trivialmeanings <- as.matrix(3:1) trivialdistance(trivialmeanings[1], trivialmeanings[2]) trivialdistance(trivialmeanings[1], trivialmeanings[3]) trivialdistance(trivialmeanings[2], trivialmeanings[3]) distmatrixfunction <- wrap.meaningdistfunction(trivialdistance) distmatrixfunction(trivialmeanings)