Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\).
Let \(p_w(n)\) be the cardinality of \(P_w(n)\).
Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).
\(|P_{f(7KWO_1)}(2) \setminus P_{f(5PSI_1)}(2)|=248\),
\(|P_{f(5PSI_1)}(2) \setminus P_{f(7KWO_1)}(2)|=3\).
Let
\(
Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)|
\)
be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:101010001110110101010000011110101001000110111010111011001110001100001110100011011010111111111010101000111010011001101011110010100110000000000000001111100001101100011110011010000100101100100111111110001011000000100111111110010010000000110000110101110100101010001111110000010101111100101001110100111000001010101101101001110110111100100000011010101000100101010000010000001000010110100000101101001100010011001110000100111111100000000010011001100000101110000010000110000111111101011001111100010010010101100101100001101100100111111011000101010011000010010000001101000110111111110000010001001100000111101100000101000100111011110100101010011001010110010101010011010110111000110111010010001100001011110100111010011111110000010001101110100000001000000000101011000011010010001000010100111001100100011000010100111001100100011000010100111000010010111011101000001000010100111001100100011000010100111011101000001011101000001000010010111000010100111000010100111000010100111001100100011001100100011011101000001000010010111000010010111001100100011000010100111000010010111000100001000000100000101010000101000000001001000000011111001100110001011000100101101001110010010100110010100011111101010100011101000100100100011000000001101000110100000011010001110000100011101001010001001111111100000101101001010011111011000001010001000001100101001010000010110101100111111100001010110110000100101010110100000001110010111100101110011110100111001011100111100000001111101010010101010010111011010001010110000110110101111111011000110001001010011110010100100001000101111110100011000110111110010101000010001010111001000011111000110010101000100111010100101010100011010100100110101000101011000110011001010011100000100101110010101101000010111001011110001010100110011101011100100100000001101110111110111111010001110001010011101000010101010101101001000100000000001101101100011010000001000111111000100101010010100111000010000101001101101001110100010100000001111000101110001010000100101100011001100000000101011
Let d be the
Otu--Sayood
distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{2121
}{\log_{20}
2121}-\frac{156}{\log_{20}156})=501.\)
Status
Protein1
Protein2
d
d1/2
Query variables
7KWO_1
5PSI_1
571
308.5
Was not able to put for d Was not able to put for d1
In notation analogous to
[Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[
\delta=
\alpha \mathrm{min} + (1-\alpha) \mathrm{max}=
\begin{cases}
d &\alpha=0,\\
d_1/2 &\alpha=1/2
\end{cases}
\]