Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\).
Let \(p_w(n)\) be the cardinality of \(P_w(n)\).
Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).
\(|P_{f(5DKZ_1)}(2) \setminus P_{f(8TOU_1)}(2)|=70\),
\(|P_{f(8TOU_1)}(2) \setminus P_{f(5DKZ_1)}(2)|=47\).
Let
\(
Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)|
\)
be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:100110000100000011000001010011011010010011100101001000111100100010010111010110010101010000000101010000010000000100111111101001101000000010100110001010101111010100011001010001110100101010111010000100010000011000000110001110000010110011101011100011111001011010000110100001001001011001100110101011110100000011111101100110100100000111111000100000110001110111111101001100010101001110010110000010010000100100010010110011110100000000101000010011111001010100110110101000000111001000011100001011011011100011011011100110111000010101000111011001011011010110001001010000100101101001000111000110000111100111110001111101000101101010111110011111111110111111010001100100011101110101010100001010101000111111010001110100110010101011101110001000111110001111001111011000000010111100010000100011011000010101110011111011011100011000011101010011111100001010101001001000010010001110100100100010001010010110010010100111111111101000101000100111110001100001111110011101110101111
Let d be the
Otu--Sayood
distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{1576
}{\log_{20}
1576}-\frac{625}{\log_{20}625})=238.\)
Status
Protein1
Protein2
d
d1/2
Query variables
5DKZ_1
8TOU_1
306
252.5
Was not able to put for d Was not able to put for d1
In notation analogous to
[Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[
\delta=
\alpha \mathrm{min} + (1-\alpha) \mathrm{max}=
\begin{cases}
d &\alpha=0,\\
d_1/2 &\alpha=1/2
\end{cases}
\]