Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\).
Let \(p_w(n)\) be the cardinality of \(P_w(n)\).
Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).
\(|P_{f(8DDW_1)}(2) \setminus P_{f(3PSE_1)}(2)|=230\),
\(|P_{f(3PSE_1)}(2) \setminus P_{f(8DDW_1)}(2)|=14\).
Let
\(
Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)|
\)
be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:110010011010010000000100000000100101100100100001000001111000110001100110000010111000010000010111001110101011000000001000010000101000001010011101010111000011010101000101110110001010110111010111001010101001110111011100111110111001110011011000100001010011111111100000111001101000100110010110010001111001001001101010001000101001000110111111111011101101110010001111111001010100111110000001111000100011101000100000010011111100100001101101100100010111101110110101100101111100101100011101001111010011101111001011011100110100110100100100000110001001100100101110001011011111001111100000000010010001111001011011110001110010000000000101010010100111110011111111000011111100100111011110010011100100001100100010000001101110110000000001110110001001001001011111000011100000111001111010100001101111111110110101000001101001001010000100100100000000101011110001000000000010000011111001001001111011100110110111100111101001100001111001101110010011100110110010111000101001111111011111010001100010110010110101011011110001110111110111011011111111110111100111110001010110011011011101011100101101000000010010111000111111111100111101111011111100011010010001101000011101000111111111100101110010001000000000000110111000010010010000100010000001000000010100001001010100100000010101001010110100111011011001011001000010000000000110110000100001001010001011100010100101110100001001010001110010011000010100100
Pair
\(Z_2\)
Length of longest common subsequence
8DDW_1,3PSE_1
244
4
8DDW_1,6HZB_1
152
4
3PSE_1,6HZB_1
186
4
Newick tree
[
3PSE_1:11.31,
[
8DDW_1:76,6HZB_1:76
]:41.31
]
Let d be the
Otu--Sayood
distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{1542
}{\log_{20}
1542}-\frac{171}{\log_{20}171})=360.\)
Status
Protein1
Protein2
d
d1/2
Query variables
8DDW_1
3PSE_1
459
257
Was not able to put for d Was not able to put for d1
In notation analogous to
[Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[
\delta=
\alpha \mathrm{min} + (1-\alpha) \mathrm{max}=
\begin{cases}
d &\alpha=0,\\
d_1/2 &\alpha=1/2
\end{cases}
\]