CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
4CVX_1 4LAL_1 7ZMH_1 Letter Amino acid
22 48 25 A Alanine
16 9 7 Q Glutamine
19 43 65 L Leucine
4 12 8 M Methionine
18 14 16 T Threonine
19 31 34 V Valine
11 23 36 S Serine
19 9 20 Y Tyrosine
4 3 3 C Cysteine
24 18 13 E Glutamic acid
35 26 27 G Glycine
13 19 41 I Isoleucine
6 17 28 F Phenylalanine
15 27 16 P Proline
22 17 9 R Arginine
7 8 11 N Asparagine
20 21 6 D Aspartic acid
14 14 1 H Histidine
11 14 6 K Lycine
11 3 6 W Tryptophan

4CVX_1|Chains A, D|MHC CLASS I ALPHA CHAIN 2|GALLUS GALLUS (9031)
>4LAL_1|Chains A, B, C, D|Uracil-5-carboxylate decarboxylase|Cordyceps militaris (983644)
>7ZMH_1|Chain A[auth 1]|NADH-ubiquinone oxidoreductase chain 1|Chaetomium thermophilum var. thermophilum DSM 1495 (759272)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
4CVX , Knot 132 310 0.81 40 193 294
ELHTLRYIRTAMTDPGPGLPWYVDVGYVDGELFVHYNSTARRYVPRTEWIAAKADQQYWDGQTQIGQGNEQIDRENLGILQRRYNQTGGSHTVQWMYGCDILEGGPIRGYYQMAYDGRDFTAFDKGTMTFTAAVPEAVPTKRKWEEGDYAEGLKQYLEETCVEWLRRYVEYGKAELGRRERPEVRVWGKEADGILTLSCRAHGFYPRPIVVSWLKDGAVRGQDAHSGGIVPNGDGTYHTWVTIDAQPGDGDKYQCRVEHASLPQPGLYSWEPRSGGGLNDIFEAQKIEWHENSSSVDKLAAALEHHHHHH
4LAL , Knot 162 376 0.85 40 211 359
MAASTPVVVDIHTHMYPPSYIAMLEKRQTIPLVRTFPQADEPRLILLSSELAALDAALADPAAKLPGRPLSTHFASLAQKMHFMDTNGIRVSVISLANPWFDFLAPDEAPGIADAVNAEFSDMCAQHVGRLFFFAALPLSAPVDAVKASIERVKNLKYCRGIILGTSGLGKGLDDPHLLPVFEAVADAKLLVFLHPHYGLPNEVYGPRSEEYGHVLPLALGFPMETTIAVARMYMAGVFDHVRNLQMLLAHSGGTLPFLAGRIESCIVHDGHLVKTGKVPKDRRTIWTVLKEQIYLDAVIYSEVGLQAAIASSGADRLMFGTAHPFFPPIEEDVQGPWDSSRLNAQAVIKAVGEGSSDAAAVMGLNAVRVLSLKAE
7ZMH , Knot 152 378 0.79 40 176 334
MSYSQTINSLVEVVLVLVPSLVGIAYVTVGERKTMGSMQRRLGPNAVGIYGLLQAFADALKLLLKEYVGPTQANLVLFFLGPVITLIFSLLGYAVIPYGPGLAVNDLSTGILYMLAVSSLATYGILLAGWSANSKYAFLGSLRSTAQLISYELVLSSSILLVIMLSGSLSLTVIVESQRAIWYILPLLPVFIIFFIGSVAETNRAPFDLAEAESELVSGFMTEHAAVIFVFFFLAEYGSIVLMCILTSILFLGGYLLISLLDIIYNNLLSWIVIGKYIIFIFPFWGPVFIDLGLYEIISYLYNAPTVEGSFYGLSLGVKTSILIFVFIWTRASFPRIRFDQLMSFCWTVLLPILFALIVLVPCILYSFNIFPVNISLL

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(4CVX_1)}(2) \setminus P_{f(4LAL_1)}(2)|=78\), \(|P_{f(4LAL_1)}(2) \setminus P_{f(4CVX_1)}(2)|=96\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:0100100100110011111110101101010111000001000110001111010000101000110100010000111100000001100010110100110111101000110010010110010101011110111000010010010110001000010110001001010110000101011100101110100010110101111011001110100100111110101000011010101101000000100101101110010100111100110100101000000100111110000000
Pair \(Z_2\) Length of longest common subsequence
4CVX_1,4LAL_1 174 5
4CVX_1,7ZMH_1 171 4
4LAL_1,7ZMH_1 159 5

Newick tree

 
[
	4CVX_1:88.38,
	[
		7ZMH_1:79.5,4LAL_1:79.5
	]:8.88
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{686 }{\log_{20} 686}-\frac{310}{\log_{20}310})=103.\)
Status Protein1 Protein2 d d1/2
Query variables 4CVX_1 4LAL_1 132 120.5
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]