Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\).
Let \(p_w(n)\) be the cardinality of \(P_w(n)\).
Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).
\(|P_{f(8VLX_1)}(2) \setminus P_{f(5TFC_1)}(2)|=229\),
\(|P_{f(5TFC_1)}(2) \setminus P_{f(8VLX_1)}(2)|=6\).
Let
\(
Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)|
\)
be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:1101001101100100100000000000000000000000000000000000000000000000011111111111011011101011110101111111111111110011001000101000001000101000111001000101001111110111100001000101110001001101110001101010100010001110010111101101101101000010110111010000001000100011111101110110110000101110111101000010100011101101000000000100111011111111100000011111111010011111000100001010111000010101010011010010100000000011011101100110011101100101111110101100001100001011011111100001110000010111100011000000000100011010100010101110011001101100110001000001010010110001000100100001100000010111001110100100100110000000001100110100000111010000011101101000000101111001001100001110010110010000010000100111000100110000010010101100000001111001011010111011001111000101010111100111111101001100100111000001000010011001001010101101110101100110000101101110100101001011001111000100000100011001100011010000000111011101101000001110001100110101011011010100100110000111010001100111011100010100111101101110110000010101111110000010101110000110010100100100100111010010100010011111000110000011011000110110011110110111001111101000000000111101110110011111010100011111101111011001000110000101110000011111100111111001100110110101011001111111011110100110101100010000110010111010010010110000000111000000011010011001010011010010001010100000011111001101100110110100110010011101000100011110101001100111001100101100010000101001100010111000011110001001110101001101000000011101100100010001001000010001100010110111101100000000101000110111011010100011000011111110010010110100001111011111111000000000111110110100111101001100111110111001111010001011001000001110111011000011011111100000000001001000110111111100010100001111100110111100101101110011101001101001011101111110111000000111001001010101100011001001000001000001001001100010011101111110011000101010000001000011011101101100111001011100110001011010010010101001100011111110011111000000111010001000010000110101010000001110111000011001111110001000100000101111001001101000111001101100001101111011000000100101100010010110100011110101001100110111011011100010111110100011011100100100010001110000010011001010010001010111000110101010100101000101011000010000011101101100111001011110001010111101011100101100011101100101101010100111100110101110111010010011101110001101101110011110011001011100000110111101011010110001110101011100001110111110110000110010011001011101111011001101000000101100000010100001001011001110110010011111000001111110111001110110111100000111111011101011101101110111011000011001100100111000001000110111111001111000001100000000101111011001110110111110111001000100011011000110010110111000101110000011000100110111010110011110000111010100011010001101010011110010110000100000001011110011001100000011101000001110100011110001000111110011001111001100000101101010010010100001110011110001111111001110110011000100001100111101110110001100010011111000110010111001010000011110101101100011011101010110101111010000010110001101100111000100101001101010010100100111111111001001000101100001011110000111110010111001001110010111011101100111100110011101100001010111011001100100010001100111101001000111111010100111010001111111101100110100101011011100100001000100011001101111110100011001001001000111001010100000000
Pair
\(Z_2\)
Length of longest common subsequence
8VLX_1,5TFC_1
235
5
8VLX_1,5JXM_1
213
4
5TFC_1,5JXM_1
168
4
Newick tree
[
8VLX_1:12.05,
[
5JXM_1:84,5TFC_1:84
]:36.05
]
Let d be the
Otu--Sayood
distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{3416
}{\log_{20}
3416}-\frac{229}{\log_{20}229})=769.\)
Status
Protein1
Protein2
d
d1/2
Query variables
8VLX_1
5TFC_1
949
506.5
Was not able to put for d Was not able to put for d1
In notation analogous to
[Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[
\delta=
\alpha \mathrm{min} + (1-\alpha) \mathrm{max}=
\begin{cases}
d &\alpha=0,\\
d_1/2 &\alpha=1/2
\end{cases}
\]