CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
7TCC_1 4OIT_1 8ZAA_1 Letter Amino acid
10 1 4 M Methionine
63 2 13 P Proline
91 10 15 V Valine
83 5 9 N Asparagine
63 6 8 Q Glutamine
52 5 20 E Glutamic acid
73 0 9 I Isoleucine
64 8 16 K Lycine
12 3 8 W Tryptophan
53 4 8 Y Tyrosine
30 0 4 C Cysteine
91 12 12 G Glycine
27 7 6 H Histidine
102 13 20 L Leucine
79 1 9 F Phenylalanine
61 10 11 D Aspartic acid
92 10 15 T Threonine
81 8 11 A Alanine
44 4 15 R Arginine
101 4 14 S Serine

7TCC_1|Chains A, B, C|Spike glycoprotein|Severe acute respiratory syndrome coronavirus 2 (2697049)
>4OIT_1|Chains A, B, C, D|LysM domain protein|Mycobacterium smegmatis (246196)
>8ZAA_1|Chains A[auth C], B[auth D]|Butyrophilin subfamily 3 member A1|Homo sapiens (9606)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
7TCC , Knot 450 1272 0.84 40 331 1101
QCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHVISGTNGTKRFDNPVLPFNDGVYFASIEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLDHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPIIVREPEDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFDEVFNATRFASVYAWNRKRISNCVADYSVLYNLAPFFTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGNIADYNYKLPDDFTGCVIAWNSNKLDSKVSGNYNYLYRLFRKSNLKPFERDISTEIYQAGNKPCNGVAGFNCYFPLRSYSFRPTYGVGHQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLKGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEYVNNSYECDIPIGAGICASYQTQTKSHGSASSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLKRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKYFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFKGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNHNAQALNTLVKQLSSKFGAISSVLNDIFSRLDPPEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQGSGYIPEAPRDGQAYVRKDGEWVLLSTFLGRSLEVLFQGPGHHHHHHHHSAWSHPQFEKGGGSGGGGSGGSAWSHPQFEK
4OIT , Knot 52 113 0.72 36 74 98
MGDTLTAGQKLERGGSLQSGNGAYTLTLQDDGNLVLYARDKAVWSTGTNGQDVVRAEVQTDGNFVLYTAEKPVWHTDTKGKKEVKLVLQDDRNLVLYAKDGPAWSLEHHHHHH
8ZAA , Knot 108 227 0.86 40 167 221
EQELREMAWSTMKQEQSTRVKLLEELRWRSIQYASRGERHSAYNEWKKALFKPADVILDPKTANPILLVSEDQRSVQRAKEPQDLPDNPERFNWHYCVLGCESFISGRHYWEVEVGDRKEWHIGVCSKNVQRKGWVKMTPENGFWTMGLTDGNKYRTLTEPRTNLKLPKTPKKVGVFLDYETGDISFYNAVDGSHIHTFLDVSFSEALYPVFRILTLEPTALTICPA

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(7TCC_1)}(2) \setminus P_{f(4OIT_1)}(2)|=261\), \(|P_{f(4OIT_1)}(2) \setminus P_{f(7TCC_1)}(2)|=4\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:001010000011110000100110010011000110000011111100101101101001000100111110011011010000110111110010000001111001001110100101000111000000011000101000100001001001111010100101001001110010101010000011110010011011011011101111101001001111000010110000110111110011010100111000001010011001101100000010010100110000010101000110110100101100110100110101100001000110001100111110100011010010010100101001110100100111100101100000110010101111000010001010000100110000101100010001001100100111110001110000101001110010011110101101110101100000110000101010110101110000001111001100110000110010010110101001111011011000000111100110000111110100101010100010011000110111100100000001111111010000000001010011000111001011100011000001111001010100011110100001000101010000000111001010001001101111000000001110100100011100111101001110100100001100111001011011110001001101110011010010110111111000111000011111010011011111110111110110010111100011000001110010011101000100010111010011000101100110010001111001100110010110101010011010100100010001101101010101110010001110000101010100110110011011111010011100001001111000101011001111001001110000100101100000110100011111100010011010100100010001000001010110101101011010001001001100100011010011000010101101100101010001011110011100101110111000000000110010100111011110110110010100
Pair \(Z_2\) Length of longest common subsequence
7TCC_1,4OIT_1 265 6
7TCC_1,8ZAA_1 190 4
4OIT_1,8ZAA_1 151 4

Newick tree

 
[
	7TCC_1:12.78,
	[
		8ZAA_1:75.5,4OIT_1:75.5
	]:50.28
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{1385 }{\log_{20} 1385}-\frac{113}{\log_{20}113})=341.\)
Status Protein1 Protein2 d d1/2
Query variables 7TCC_1 4OIT_1 432 232.5
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]