CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
3KOQ_1 4PXE_1 3SDQ_1 Letter Amino acid
9 15 27 N Asparagine
17 31 70 E Glutamic acid
18 45 101 L Leucine
5 14 36 F Phenylalanine
7 11 39 Y Tyrosine
12 30 52 V Valine
12 26 49 D Aspartic acid
8 10 14 H Histidine
5 10 14 M Methionine
10 19 34 P Proline
8 16 36 T Threonine
2 1 15 W Tryptophan
10 16 43 R Arginine
3 2 15 C Cysteine
7 6 27 Q Glutamine
10 41 57 S Serine
9 44 54 A Alanine
9 33 39 G Glycine
15 30 46 I Isoleucine
15 30 49 K Lycine

3KOQ_1|Chains A, B, C, D|NITROREDUCTASE FAMILY PROTEIN|Clostridium difficile (272563)
>4PXE_1|Chains A, B|Ureidoglycolate hydrolase|Arabidopsis thaliana (3702)
>3SDQ_1|Chain A|Alpha-bisabolene synthase|Abies grandis (46611)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
3KOQ , Knot 93 191 0.85 40 148 185
MGSDKIHHHHHHENLYFQGMNFVELAKKRYSCRNYQDRKVEKEKLEKVLDVARIAPTGGNRQPQRLIVIQEKEGINKLSKAANIYDAPLAILVCGDKDKVWTRPFDGKQLTDIDTSIVTDHMMLQATELGLASVWVCYFNPDIIREEFSLPDNLEPINILLMGYESKIPESPERHEKTRVPLSEIVSYETL
4PXE , Knot 176 430 0.82 40 208 401
GHMFGSINLASSLSVDAPGLQNQIDELSSFSDAPSPSVTRVLYTDKDVSARRYVKNLMALAGLTVREDAVGNIFGKWDGLEPNLPAVATGSHIDAIPYSGKYDGVVGVLGAIEAINVLKRSGFKPKRSLEIILFTSEEPTRFGISCLGSRLLAGSKELAEALKTTVVDGQNVSFIEAARSAGYAEDKDDDLSSVFLKKGSYFAFLELHIEQGPILEDEGLDIGVVTAIAAPASLKVEFEGNGGHAGAVLMPYRNDAGLAAAELALAVEKHVLESESIDTVGTVGILELHPGAINSIPSKSHLEIDTRDIDEARRNTVIKKIQESANTIAKKRKVKLSEFKIVNQDPPALSDKLVIKKMAEAATELNLSHKMMISRAYHDSLFMARISPMGMIFIPCYKGYSHKPEEYSSPEDMANGVKVLSLTLAKLSLD
3SDQ , Knot 309 817 0.84 40 299 734
MAGVSAVSKVSSLVCDLSSTSGLIRRTANPHPNVWGYDLVHSLKSPYIDSSYRERAEVLVSEIKAMLNPAITGDGESMITPSAYDTAWVARVPAIDGSARPQFPQTVDWILKNQLKDGSWGIQSHFLLSDRLLATLSCVLVLLKWNVGDLQVEQGIEFIKSNLELVKDETDQDSLVTDFEIIFPSLLREAQSLRLGLPYDLPYIHLLQTKRQERLAKLSREEIYAVPSPLLYSLEGIQDIVEWERIMEVQSQDGSFLSSPASTACVFMHTGDAKCLEFLNSVMIKFGNFVPCLYPVDLLERLLIVDNIVRLGIYRHFEKEIKEALDYVYRHWNERGIGWGRLNPIADLETTALGFRLLRLHRYNVSPAIFDNFKDANGKFICSTGQFNKDVASMLNLYRASQLAFPGENILDEAKSFATKYLREALEKSETSSAWNNKQNLSQEIKYALKTSWHASVPRVEAKRYCQVYRPDYARIAKCVYKLPYVNNEKFLELGKLDFNIIQSIHQEEMKNVTSWFRDSGLPLFTFARERPLEFYFLVAAGTYEPQYAKCRFLFTKVACLQTVLDDMYDTYGTLDELKLFTEAVRRWDLSFTENLPDYMKLCYQIYYDIVHEVAWEAEKEQGRELVSFFRKGWEDYLLGYYEEAEWLAAEYVPTLDEYIKNGITSIGQRILLLSGVLIMDGQLLSQEALEKVDYPGRRVLTELNSLISRLADDTKTYKAEKARGELASSIECYMKDHPECTEEEALDHIYSILEPAVKELTREFLKPDDVPFACKKMLFEETRVTMVIFKDGDGFGVSKLEVKDHIKECLIEPLPL

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(3KOQ_1)}(2) \setminus P_{f(4PXE_1)}(2)|=46\), \(|P_{f(4PXE_1)}(2) \setminus P_{f(3KOQ_1)}(2)|=106\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:11000100000000101011011011000000000000010000100110110111011000100111100001100100110100111111101000011001101001001000110001110100111101110010101100010110010110111110000110010000000111001100001
Pair \(Z_2\) Length of longest common subsequence
3KOQ_1,4PXE_1 152 4
3KOQ_1,3SDQ_1 195 4
4PXE_1,3SDQ_1 157 5

Newick tree

 
[
	3SDQ_1:92.30,
	[
		3KOQ_1:76,4PXE_1:76
	]:16.30
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{621 }{\log_{20} 621}-\frac{191}{\log_{20}191})=122.\)
Status Protein1 Protein2 d d1/2
Query variables 3KOQ_1 4PXE_1 150 108
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]