Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\).
Let \(p_w(n)\) be the cardinality of \(P_w(n)\).
Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).
\(|P_{f(8YQV_1)}(2) \setminus P_{f(5IOY_1)}(2)|=221\),
\(|P_{f(5IOY_1)}(2) \setminus P_{f(8YQV_1)}(2)|=9\).
Let
\(
Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)|
\)
be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:1011010111101011100000001110101001101011101110010110000000010000000001101111010111101111101001101101001111101000001101001101100000100001001101011000000101110001010010101100110010000110110000001001110110111101011101111011001001001100110001111001011010011101000100100100011100100010011010011111101110110011000101000111001101000010100010100110110110010110010000100111011010000110001000100010010110001010110110001101011110001010000111001111001010010101010110010101001011111011001010110010011100000111010100001101110000011100110010111110000001101100010011010011011100011000011010001011010000001000100101101110001111100111001100001100110111100011100100111010010111010100010011001110000100011010111111100001000101011011001101110010100011101110110100101101111110101000010101011001100101110101011100001111001011110101010110011000001010001111100011000001010001100101001101001001010011100001000100011001110001001000000000111010010100110010011101101100111000011111000011000111001000110111001000100111101001101101110101101001010000101110110100000110010111111100100110001100000011110000111010011010110100000111010010100000010011001011010011101011000000000011010110010110011000111011001101010101000011100101001100101001000110010001011111101010001100000001100001110110011000101111100101101100010101011010010110001001111110001010011000110010101101100011001001110011000011101011000101001001110100100111011100110110011100110110111110111011011010001110000100000010011011
Pair
\(Z_2\)
Length of longest common subsequence
8YQV_1,5IOY_1
230
4
8YQV_1,2XNC_1
186
4
5IOY_1,2XNC_1
184
3
Newick tree
[
8YQV_1:10.44,
[
2XNC_1:92,5IOY_1:92
]:16.44
]
Let d be the
Otu--Sayood
distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{1666
}{\log_{20}
1666}-\frac{216}{\log_{20}216})=375.\)
Status
Protein1
Protein2
d
d1/2
Query variables
8YQV_1
5IOY_1
487
276
Was not able to put for d Was not able to put for d1
In notation analogous to
[Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[
\delta=
\alpha \mathrm{min} + (1-\alpha) \mathrm{max}=
\begin{cases}
d &\alpha=0,\\
d_1/2 &\alpha=1/2
\end{cases}
\]